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METHOD FOR CONSTRUCTING PREFERRED VIEWS OF 
HIERARCHICAL DATA 
Copyright Notice 

This patent specification contains material that is subject to copyright protection. 
The copyright owner has no objection to the reproduction of this patent specification or 
related materials from associated patent office files for the purposes of review, but 
otherwise reserves all copyright whatsoever. 

Technical Field of the Invention 
The present invention relates to the general field of information retrieval and, in 
particular, to the automatic identification and retrieval of relevant data from large 
hierarchical data sources. 

Background 

Extensible Markup Language (XML) is increasingly becoming a popular 
hierarchical format for storing and exchanging information. Whilst the hierarchical nature 
of XML makes it an excellent means for capturing relationships between data objects, it 
also makes keyword searching more difficult. 

Keyword searching is of particular importance when dealing with a structured data 
format such as XML because it allows the user to locate particular keywords quickly 
without the need to know the internal structure of the data. It is a challenge when working 
with XML because there is no optimal or clearly preferred method for presenting the result 
of a keyword search. In the traditional unstructured text environment, the data system 
typically presents the user with the located keywords together with other text in their 
vicinity. If there are more than one 'hit', then the neighbouring text provides a useful 
context for distinguishing between hits, thereby allowing the user to quickly select the 
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most relevant hit according to user's needs. 

In the structured XML environment, there is no clear concept of 'neighbouring 
data' since data that are related to one another may reside at several disjoint locations 
within an XML document. Thus it is difficult to identify or construct a suitable context for 
5 a hit in a keyword search. Consequently, most existing XML based systems simply return 
an entire XML document (out of a collection of XML documents) if a keyword hit occurs 
within the document, with the entire document effectively serving as the context for the 
hit. This is undesirable when documents are large and the user is not interested in seeing 
all of their contents. 

10 Practical data sources, especially databases, often contain much more data than a 

user typically wants to see at any one time. For example, a database in a mail order store 
may contain details about all of its product lines, customers, suppliers, couriers, and lists 
of past and pending orders. A store clerk may at one time wish to see the current stock 
level for a particular product, and at another time may want to check the status of an order 

15 for a customer. A store manager on the other hand may wish to see the variation of the 
total sales for a particular product line over a number of months. In each of these cases it 
would be too distracting to the user if an avalanche of additional irrelevant data were to be^ 
also presented. Further, unless the user is familiar with the structure of the database, the 
user would typically be unable to identify information about which the user, has an interest. 

20 The traditional method for providing only relevant data is through the use of pre- 

created "views", prepared by someone who is familiar with the structure of the data 
source, such as a system administrator. Each view draws together some subset of the data 
source and is tailored for a distinct purpose. In the previously given examples, the store 
clerk would consult a "stock level" view or an "order status" view, whilst the manager 
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would bring up a "sales" view. 

Whilst this approach of using pre-created views may be satisfactory when all likely 
usage scenarios can be anticipated, it is inadequate for keyword searching. In a keyword 
search operation, a user enters one or more keywords and the system responds with a data 
5 set or view that includes occurrences of all keywords (assuming an AND Boolean keyword 
search operation), hi a hierarchical environment such as XML, keyword hits may occur in 
several data items residing at different locations in the hierarchy. Since it is not feasible to 
anticipate all possible keyword combinations that a user may provide, it is not possible to 
pre-determine where in the hierarchy hits will occur. Consequently it is not possible to 
1 0 provide pre-created views that will cater for all search scenarios. 

An analogous keyword searching problem also exists in the relational database 
environment. A relational database comprises tables joined through their primary and 
foreign keys, where each table comprises a plurality of rows each denoting an n-tuple of 
attribute values for some entity, A traditional solution to keyword searching in a relational 

15 database, described by Hristidis, V. and Papakonstantinou, Y., 'THSCOVER: Keyword 
Search in Relational Databases", Proceedings of the 28th VLDB Conference y 2002, is to 
return a minimal joining network, which is the smallest network of joined rows across 
joined tables that contain all keyword hits. A problem with this approach is that it 
effectively treats rows as the smallest data "chunks" in that if a keyword hit occurs any 

20 where in a row of a database table then the entire row is returned as context for the hit. 
This may lead to excessive amounts of data being presented to the user since a typical 
relational database table often contains many columns that are not usually of interest to the 
user. 

Further, adapting the above technique to hierarchical data structures such as XML 
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may result in insufficient context information. In a hierarchical environment, related data 
may be stored at different levels in the hierarchy, and thus often data stored in a parent or 
ancestor node or their children may provide very useful context for a keyword hit, even 
though these may not be included in the minimal data set. 
5 Some attempts have been made to address the keyword searching problem in 

hierarchical data. Florescu, D. et al, "Integrating Keyword Search into XML Query 
Processing", Ninth International World Wide Web Conference, May 2000, discloses a 
method of augmenting a structural query language with a keyword searching operator 
contains. This operator evaluates to TRUE if a specified sub-tree contains some specified 
10 keywords. The user can use this operator when constructing queries to filter out unwanted 
data. Whilst this useful feature does not require the user to specify the exact location of hit 
keywords within a given sub-tree, it does not go far enough since the user is still required 
to specify the exact format of the returned data in the search query and hence the user 
would still need to be familiar with the structure of the data source. In other words, free- 
15 text keyword searching is still not possible, unless the user is willing to accept an entire 
data source as a result of the search. 

Another existing approach to keyword searching in an XML data source requires 
the user to select from a given list of schema elements, the element representing the root 
node of the returned data. If a keyword hit occurs in a descendant node of a data element 
20 represented by the selected schema element, then the entire sub-tree below the data 
element, containing the hit keyword, is returned to the user. This approach is cumbersome 
because it requires user interventions. Furthermore, the user is forced to accept an entire 
sub-tree even though it may contain data not of interest to the user. 

Accordingly, there is a need for a method for determining a set of relevant data in a 



645893_final.doc 



5- 



hierarchical data environment in response to a keyword search operation involving 
arbitrary combinations of keywords that does not require user interventions or prior user 
knowledge of the structure of the hierarchical data. 

Summary 

5 It is an object of the present invention to substantially overcome, or at least 

ameliorate, one or more disadvantages of existing methods. 

In accordance with one aspect of the present invention, there is disclosed a method 

of constructing at least one hierarchical data structure from at least one hierarchical data 

source, said method comprising the steps of: 
10 (i ) constructing a representation of said least one hierarchical data source and 

at least one previous view of said least one hierarchical data source; 

(ii) . identifying at least one compulsory entity in said representation; and 

(iii) constructing said at least one hierarchical data structure comprising said 
least one compulsory entity and one or more context entities, where said context entities 

15 are obtained from said representation and context data obtained from said least one 
previous view. 

In accordance with another aspect of the present invention, there is disclosed a 
method of selecting data from a hierarchically-structured data source, said method 
comprising the steps of 

20 (i) forming a graphical representation of said hierarchically-structured data 

source; 

(ii) detecting a user selection of part of said representation; 

(iii) selecting a set of descendant hierarchical components in said user-selected 
part based on an occurrence probability of said set of hierarchical components given a root 
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component of said user-selected part. 

In accordance with another aspect of the present invention, there is disclosed a 
method of construction and presentation of data for a keyword searching operation in at 
least one hierarchical data source involving at least one search keyword, said method 
5 comprising the steps of: 

(i) constructing a graphical representation of said least one hierarchical data 
source and at least one previous view of said least one hierarchical data source; 

(ii) identifying at least one compulsory entity in said graphical representation, 
where said compulsory entity is a node in said graphical representation representing a 

1 0 location of one or more said least one search keyword; 

(iii) constructing at least one hierarchical data structure comprising said least 
one compulsory entity and one or more context entities, where said context entities are 
obtained from said graphical representation and context data obtained from said least one 
previous view; and 

15 (iv) presenting said least one hierarchical data structure as result of said 

keyword searching operation. 

In accordance with another aspect of the present invention, there is disclosed a 
method of presentation of data sourced from a sub-tree of a hierarchically-presented data, 
said method comprising the steps of 
20 (i) selecting a set of descendant nodes in said sub-tree based on context data 

obtained from at least one previous presentation of said hierarchically-presented data; and 

(ii) constructing and presenting a hierarchical data structure comprising a root 
node of said sub-tree and said selected set of descendant nodes. 
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Other Aspects of the present invention, including apparatus and computer media, 
are also disclosed. 

Brief Description of the Drawings 

At least one embodiment of the present invention will now be described with 

5 reference to the drawings in which: 

Fig. 1 is an example schema graph; 
Fig. 2 is a flowchart of a keyword searching method; 
Figs. 3A and 3B show two example parent nodes in a schema graph; 
Fig. 4 is a diagram of a network of server and client computers; 
10 Fig. 5 is a flowchart of a method for identifying context nodes among a set of child 

nodes of a parent node not lying along a directed path from the root node to a hit node; 

Fig. 6 is a flowchart of another method for identifying context nodes among a set 
of child nodes of a parent node not lying along a directed path from the root node to a hit 
node; 

Fig. 7 is a flowchart of a method for identifying context nodes among a set of child 
nodes of a parent node lying along a directed path from the root node to a hit node; 

Fig. 8 is a flowchart of another method for identifying context nodes among a set 
of child nodes of a parent node lying along a directed path from the root node to a hit 
node; 

20 Fig. 9 is an example schema graph with two identical sub-trees; 

Fig. 10 is a flowchart of the first, bottom-up traversal phase of the context node 
identification method with probability averaging; 

Fig. 1 1 is an example of a schema graph with multiple hit nodes; 

Fig. 12 is a flowchart of the first, bottom-up traversal phase of the context node 



15 
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identification method with probability averaging and involving multiple hit nodes; 

Fig. 13 is a flowchart of a method for identifying context nodes among a set of 
child nodes of a parent node not lying along a directed path from the root node to a hit 
node, with probability averaging; 
5 Fig. 14 is a flowchart of a method for identifying context nodes among a set of 

child nodes of a parent node lying along a directed path from the root node to a hit node, 
with probability averaging; 

Fig. 15 is an example of a parent node whose descendant hit nodes are all located 

under a single child node; 
10 Fig. 16 is a flowchart of a method for identifying context nodes among a set of 

child nodes of a parent node lying along a directed path from the root node to a hit node, 
with probability averaging, for the case where all multiple hit nodes are located under a 
single child node; 

Fig. 17 is a flowchart of a method for identifying context nodes among a set of 
15 child nodes of a parent node lying along a directed path from the root node to a hit node, 
with probability averaging, for the case where hit nodes are located under multiple child 
nodes; 

Fig. 18 is a flowchart of a method for identifying context nodes in which one or 

multiple hit nodes may be present; 
20 Fig. 19 is a flowchart of a method for constructing context trees for cases involving 

a single hit node; 

Fig. 20 is a flowchart of a method for constructing context trees for cases involving 
multiple hit nodes; 

Fig. 21 is a flowchart of a method for constructing an alternative set of bit nodes 
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that have higher observation frequencies than those in an original set of hit nodes; 

Fig. 22 is a flowchart of a method for selecting an ancestor of a set of hit nodes that 
has a higher observation frequency than the set of hit nodes; 
Fig. 23 is an example schema graph; 
5 Fig. 24 is a schema graph of an example data view; 

Fig. 25 is a schema graph of another example data view; 

Fig. 26 is a schema graph of yet another example data view; 

Fig. 27 is an occurrence frequency table arising from the data views in Figs. 24, 25 

and 26; 

10 Fig. 28 is a co-occurrence frequency table arising from the data views in 

Figs. 24, 25 and 26; 

Fig. 29 is a leaf co-occurrence frequency table arising from the data views in 
Figs. 24, 25 and 26; 

Fig. 30 is a sole child co-occurrence frequency table arising from the data views in 

15 Figs. 24, 25 and 26; 

Fig. 31 is a portion of a joint-occurrence frequency table arising from the data 

views in Figs. 24, 25 and 26; 

Fig. 32 is another portion of a joint-occurrence frequency table arising from the 

data views in Figs. 24, 25 and 26; 
20 Fig. 33 is yet another portion of a joint-occurrence frequency table arising from the 

data views in Figs. 24, 25 and 26; 

Fig. 34 is yet another portion of a joint-occurrence frequency table arising from the 

data views in Figs. 24, 25 and 26; 

Fig. 35 is yet another portion of a joint-occurrence frequency table arising from the 
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data views in Figs. 24, 25 and 26; 

Fig. 36 is the schema graph of a context tree returned as a result of a keyword 

search operation involving two keywords; 

Fig. 37 is a schematic block diagram of a general purpose computer upon which 

5 the arrangements described may be practiced; 

Fig. 38 is a flowchart of a sub-process within the method for constructing context 
trees for cases involving a single hit node depicted in Fig. 19; and 

Fig. 39 is a flowchart of a sub-process within the* method for constructing context 
trees for cases involving multiple hit nodes depicted in Fig. 20. 
j 0 Detailed Description including Best Mode 

The present disclosure provides a method for determining a set of relevant data in a 
hierarchical data environment in response to a keyword search operation involving one or 
more keywords. A preferred implementation includes a Bayesian probabilistic based 
method that constructs preferred views of data in a hierarchical data structure based on 
15 how data is accessed in past episodes. More specifically, the method makes use of the 
frequencies of past joint-occurrences between pairs and vectors of data items to compute 
the probability that a data item is relevant to some other compulsory data items. Typically, 
the compulsory data items are those containing keyword hits, and thus must be returned to 
the user in the keyword search results. If a non-compulsory data item has a high 
20 probability of being relevant to a compulsory data item, then it is likely to be returned in 
the search results to serve as context for the keyword hits. 

A distinguishing feature of the presently disclosed arrangements with respect to 
traditional pre-created view based approaches is that the former is able to synthesise new 
views, rather than merely returning an existing stored view. Such arrangements are thus 
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able to handle keyword search operations involving arbitrary keyword combinations, and 
since views are dynamically generated, they can be better tailored to individual operations 
than those obtained from a fixed pool of pre-created views. 

The presently disclosed methods typically construct a number of alternative views 
5 and assigns a score for each view, signifying how much the view may be of interest to the 
user. In one implementation, a single view that has the highest score among those 
constructed is returned to the user. In an alternative implementation, a list of views is 
returned, sorted according to their scores, from highest to lowest. 

Although keyword searching is its primary motivation, the presently disclosed 
10 methods can also be used to enhance a method of presentation of hierarchical data, such as 
that described in Australian Patent Application No. 2003204824 filed 19 June 2003 and 
corresponding United States Patent Application No. 10/465,222 filed 20 June 2003, both 
entitled "Methods for Interactively Defining Transforms and for Generating Queries by 
Manipulating Existing Query Data. In that publication, a method for selecting the most 
15 appropriate presentation type (such as tables, graphs, plots, tree, etc..) based on the 
structure and contents of a hierarchical data source is disclosed. That method can be 
enhanced by incorporating a preferred implementation of the present disclosure as a means 
for automatically selecting a most preferred subset of the data source for display, prior to 
the selection of presentation type. It is often useful to display only a preferred subset of 
20 data in this way since hierarchical data sources often contain more information than what 
would normally be of interest to the user, and hence a method for filtering out 
'uninteresting' data such as the preferred embodiment of the present invention can help to 
make the user's experience more satisfying and productive. 
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Some portions of the description which follows are explicitly or implicitly 
presented in terms of algorithms and symbolic representations of operations on data within 
a computer memory. These algorithmic descriptions and representations are the means 
used by those skilled in the data processing arts to most effectively convey the substance 
5 of their work to others skilled in the art. An algorithm is here, and generally, conceived to 
be a self-consistent sequence of steps leading to a desired result. The steps are those 
requiring physical manipulations of physical quantities. Usually, though not necessarily, 
these quantities take the form of electrical or magnetic signals capable of being stored, 
transferred, combined, compared, and otherwise manipulated. It has proven convenient at 
10 times, principally for reasons of common usage, to refer to these signals as bits, values, 
elements, symbols, characters, terms, numbers, or the like. 

It should be borne in mind, however, that the above and similar terms are to be 
associated with the appropriate physical quantities and are merely convenient labels 
applied to these quantities. Unless specifically stated otherwise, and as apparent from the 
15 following, it will be appreciated that throughout the present specification, discussions 
utilizing terms such as "scanning", "calculating", "deternuning", "replacing", "generating" 
"imtializing", "outputting", or the like, refer to the action and processes of a computer 
system, or similar electronic device, that manipulates and transforms data represented as 
physical (electronic) quantities within the registers and memories of the computer system 
20 into other data similarly represented as physical quantities within the computer system 
memories or registers or other such information storage, transmission or display devices. 

The present specification also discloses apparatus for performing the operations of 
the methods. Such apparatus may be specially constructed for the required purposes, or 
may comprise a general purpose computer or other device selectively activated or 
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reconfigured by a computer program stored in the computer. The algorithms and displays 
presented herein are not inherently related to any particular computer or other apparatus. 
Various general purpose machines may be used with programs in accordance with the 
teachings herein. Alternatively, the construction of more specialized apparatus to perform 
5 the required method steps may be appropriate. The structure of a conventional general 
purpose computer will appear from the description below. 

In addition, the present specification also discloses a computer readable medium 
comprising a computer program for performing the operations of the methods. The 
computer readable medium is taken herein to include any transmission medium for 

10 communicating the computer program between a source and a designation. The 
transmission medium may include storage devices such as magnetic or optical disks, 
memory chips, or other storage devices suitable for interfacing with a general purpose 
computer. The transmission medium may also include a hard-wired medium such as 
exemplified in the Internet system, or wireless medium such as exemplified in the GSM 

15 mobile telephone system. The computer program is not intended to be limited to any 
particular programming language and implementation thereof It will be appreciated that a 
variety of programming languages and coding thereof may be used to implement the 
teachings of the disclosure contained herein. 

Where reference is made in any one or more of the accompanying drawings to 

20 steps and/or features, which have the same reference numerals, those steps and/or features 
have for the puiposes of this description the same functions) or operations), unless the 
contrary intention appears. 

The methods of keyword searching in general, and hierarchical data structure 
construction in particular, are preferably practiced using a general-purpose computer 
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system 3700, such as that shown in Fig. 37 wherein the processes of Figs. 1 to 36 may be 
implemented as software, such as an application program executing within the computer 
system 3700. In particular, the steps of keyword searching are effected by instructions in 
the software that are carried out by the computer. The instructions may be formed as one 
or more code modules, each for performing one or more particular tasks. The software 
may also be divided into two separate parts, in which a first part performs the searching 
methods and a second part manages a user interface between the first part and the user. 
The software may then be stored in a computer readable medium, including the storage 
devices described below, for example. The software is loaded into the computer from the 
computer readable medium, and then executed by the computer. A computer readable 
medium having such software or computer program recorded on it is a computer program 
product. The use of the computer program product in the computer preferably effects an 
advantageous apparatus for keyword searching and hierarchical data structure 
construction. 

The computer system 3700 is formed by a computer module 3701, input devices 
such as a keyboard 3702 and mouse 3703, output devices including a printer 3715, a 
display device 3714 and loudspeakers 3717. A Modulator-Demodulator (Modem) 
transceiver device 3716 is used by the computer module 3701 for communicating to and 
from a communications network 3720, for example collectable via a telephone line 3721 
or other functional medium. The modem 3716 can be used to obtain access to the Internet, 
and other network systems, such as a Local Area Network (LAN) or a Wide Area Network 
(WAN), and may be incorporated into the computer module 3701 in some 
implementations. 
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The computer module 3701 typically includes at least one processor unit 3705, and 
a memory unit 3706, for example formed from semiconductor random access memory 
(RAM) and read only memory (ROM). The module 3701 also includes an number of 
input/output (I/O) interfaces including an audio-video interface 3707 that couples to the 
5 video display 3714 and loudspeakers 3717, an I/O interface 3713 for the keyboard 3702 
and mouse 3703 and optionally a joystick (not illustrated), and an interface 3708 for the 
modem 3716 and printer 3715. In some implementations, the modem 3716 may be 
incorporated within the computer module 3701, for example within the interface 3708. A 
storage device 3709 is provided and typically includes a hard disk drive 3710 and a floppy 
10 disk drive 3711. A magnetic tape drive (not illustrated) may also be used. A CD-ROM 
drive 3712 is typically provided as a non-volatile source of data. The components 3705 
to 3713 of the computer module 3701, typically communicate via an interconnected 
bus 3704 and in a manner which results in a conventional mode of operation of the 
computer system 3700 known to those in the relevant art. Examples of computers on 
15 which the described arrangements can be practised include IBM-PC's and compatibles, 
Sun Sparcstations or alike computer systems evolved therefrom. 

Typically, the application program is resident on the hard disk drive 3710 and read 
and controlled in its execution by the processor 3705. Intermediate storage of the program 
and any data fetched from the network 3720 may be accomplished using the 
20 semiconductor memory 3706, possibly in concert with the hard disk drive 3710. In some 
instances, the application program may be supplied to the user encoded on a CD-ROM or 
floppy disk and read via the corresponding drive 3712 or 3711, or alternatively may be 
read by the user from the network 3720 via the modem device 3716. Still further, the 
software can also be loaded into the computer system 3700 from other computer readable 
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media. The term "computer readable medium" as used herein refers to any storage or 
transmission medium that participates in providing instructions and/or data to the 
computer system 3700 for execution and/or processing. Examples of storage media 
include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated 
5 circuit, a magneto-optical disk, or a computer readable card such as a PCMCIA card and 
the like, whether or not such devices are internal or external of the computer module 3701 . 
Examples of transmission media include radio or infra-red transmission channels as well 
as a network connection to another computer or networked device, and the Internet or 
Intranets including e-mail transmissions and information recorded on Websites and the 
10 like. 

Keyword searching in a hierarchical environment comprises identifying the nodes 
or elements in the hierarchical data structure where the keyword or keywords occur and 
then determining what other data elements are relevant to the keywords. In a typical 
keyword searching scenario, the resulting data presented to the user is a second 
15 hierarchical data structure extracted from the first data structure and containing all or some 
of the search keywords and other data considered to be relevant to these keywords. Such a 
hierarchical data structure presented to the user as a result of the keyword search operation 
is referred to as a context tree. 

When the hierarchical data being searched has a governing schema, as is often is 
20 the case with XML, it is generally advantageous to employ a method for identifying 
relevant data that operates at the schema level. That is, elements within the schema 
representation are analysed to determine whether they are relevant to the search keywords. 
All instances of data items in the data source collectively represented by the relevant 
schema elements are then returned to the user as the result of the keyword search 
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operation. In XML the governing schema can be in the form of an XML Schema, which 
itself is another hierarchical data structure. An XML Schema specifies the structure of the 
associated XML data, the list of elements and attributes in the XML data and their parent- 
child relationships. Since each element or attribute in an XML Schema typically 
represents many instances of elements and attributes in the XML data, an XML Schema is 
potentially a much smaller data structure and hence can be analysed more efficiently. 

It is often desirable to search for keywords in more than one hierarchical data 
source. Although each hierarchical data source on its own is tree-structured, when 
multiple data sources are considered together the resulting data structure may take on a 
more general form. One such form that invariably arises in a database environment is 
illustrated in Fig. 1. This structure essentially comprises a number of trees with shared 
nodes, where each tree represents the schema of a distinct hierarchical data source and the 
shared nodes are the result of data views whose contents span multiple data sources. 
Specifically the dotted boxes 1005 and 1010 in Fig. 1 denote the schemas of a first and 
second data source respectively, and node 1015 is the root node of a data view that brings 
together nodes 1020 and 1025 from the first data source and node 1030 from the second 
data source. The multiple shared-tree structure in the Fig. 1, referred herein as a schema 
graph, is a special form of a directed acyclic graph with an important characteristic that 
there is at most a single directed path between any two nodes. For example, there is only 
one directed path from node 1015 to node 1035 and this path passes through node 1020. 

The schema graph is preferably constructed prior to a keyword searching operation 
and is made up of the initially disjoint individual tree-structured schemas of the 
hierarchical data sources. These schema trees are then joined when a data view is created 
that spans more than one data source. A data view typically comprises a query (such as an 
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XQuery in an XML environment) and may be created by a database adniinistrator or user. 
In either case, the database system preferably logs or records these queries in its storage 
device. During the construction of the schema graph, a schema representation of each 
logged query is created and inserted into the schema graph. This results in a joining of 
two or more separate schema trees if the schema being inserted contains nodes from these 
trees, as illustrated in Fig.l. It is also possible that the schema being inserted into the 
schema graph contains nodes from only one data source, in which case a joining of 
separate schema trees does not arise. Instead the insertion operation simply results in new 
nodes being added and linked to existing nodes from a single schema tree in the schema 
graph. 

The schema graph may be updated continually as new queries are logged, or it may 
be updated on one or more occasions after new queries and data views have been collected 
over some period of time. Regardless of how often the schema graph is updated, when a 
keyword search operation is initiated, the schema graph current at the time of the operation 
is the one that is used to determine data views that are returned to the user. For the 
remainder of this document, the term "schema graph" refers to the schema graph that is 
current at the time a keyword search operation is performed. 

As in the case of a . single data source, keyword searching within multiple 
hierarchical data sources involves first identifying nodes within the schema graph where 
the search keywords are found, referred to as "bit" nodes, and then identifying nodes that 
are relevant to the hit nodes, referred to as "context" nodes. A data structure comprising 
the hit and context nodes is then constructed and presented to the user. Since hit nodes 
can be located in more than one data source, the resulting data structure presented to the 
user may span multiple data sources. The resulting data structure is preferably also tree- 
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structured since its intended applications are in hierarchical data environments. 

Fig. 4 shows a preferred configuration 4000 and generalised mode of operation of 
the keyword searching methods. The configuration 4000 comprises a PC client 4005, a 
data server 4010, a database 4015, a keyword search client 4025, and an index server 4030 

5 connected together in a network. 

Each of the devices 4005, 4010, 4025 and 4030 is typically formed by a 
corresponding general purpose computer system, such as the system 3700, each linked by 
the network 3720, which is only illustrated conceptually in Fig. 4. This conceptual 
illustration is used to provide for an uncluttered representation of data flows between the 
10 various devices 4005, 4010, 4025 and 4030, and which occur across the network 3720. 
When necessary, appropriate or convenient, the various devices 4005, 4010, 4025 and 
4030 may be combined into a smaller number of distinct computer systems 3700. For 
example, in some implementations, it maybe convenient to combine the servers 4010 and 
4030 into one computer system 3700, and combine the clients 4005 and 4025 into another 
15 computer system 3700, those systems 3700 being linked by the network 3720. 

Data stored in the database 4015 is typically accessed by a user browsing at the PC 
client 4005. A browsing application, operating in the client 4005 issues commands 
preferably in the form of XQueries 4006 which are then transmitted to the data 
server 4010. Each XQuery4006 is recorded in a log 4020 and analysed by the data 
20 server 4010, after which the requested data 4007 is fetched from the database 4015 and 
delivered to the PC client user 4005. At some point in time, preferably after a sufficient 
amount of XQueries 4006 have been logged, the index server 4030 is activated and the 
logged XQueries 4020 are analysed to build an index table 4035. This process involves 
constructing a schema graph representation of the data stored in the database 4015 and its 

645893_final.doc 



co- 
existing views represented by the logged XQueries 4020, building various frequency tables 
associated with these views, identifying searchable keywords in the database, detennining 
one or more context trees and constructing a corresponding XQuery for each context tree, 
and finally recording these keywords and XQueries in an index table 4035 for later quick 
5 retrieval. 

Once the building process of the index table 4035 completes, the system 4000 is 
ready to perform keyword search operations invoked at the keyword search client 4025. 
Search keywords 4026 entered by the user are transmitted to the index server 4030 where 
they are looked-up against the index table 4035 and one or more XQueries 4031 are 
10 retrieved and presented to the user, appropriately ranked according to their relevance to the 
search keywords. When the user selects an XQuery 4027 from the list, the XQuery 4027 
is transmitted by the keyword search client 4025 to the data server 4010 which responds 

with the appropriate data 401 1 . 

The method 2000 of keyword searching involving one or more hierarchical data 
15 sources is summarised by the flowchart in Fig. 2. The method 2000 is preferably executed 
on the computer of the index server 4030. The method 2000 begins at step 2005, where 
hit nodes are identified in the schema graph. In an XML environment, there are 
potentially two ways in which a hit node can arise in the schema graph: (i) its element 
name may contain one of the search keywords or (ii) one or more XML nodes it represents 
20 may contain one of the search keywords. Subsequent to step 2005, step 2010 identifies 
context trees in the schema graph, each comprising nodes in the data sources represented 
by the hit and context nodes in the schema graph.. Finally at step 2015, the identified 
context trees are converted to XQueries and presented to the user as a ranked Ust. 

Methods for identifying context trees denoted by step 2010 in Fig. 2 are now 
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described in detail. A method is first presented for the special case where there is a single 
hit node in the schema graph, followed later by a more general method that can handle 
cases involving more than one hit node. Both methods operate in two phases. The first is 
a bottom-up traversal of the schema graph from the hit nodes to determine which of their 
5 parents and ancestors are context nodes, from which the second phase proceeds in a top- 
down fashion to determine which of their descendants are also context nodes. The top- 
most ancestor of the bit nodes determined to be a context node then represents the root 
node of the context tree presented to the user as a result of the keyword search 
operation.For the purpose of determining whether a node in the schema graph is a context 
10 node, preferably at least an occurrence frequency table and a co-occurrence frequency 
table are maintained. The former records the frequencies at which each node in the 
schema graph occurred in a logged query or data view whilst the latter records the 
frequencies at which pairs of nodes in the schema graph co-occur in the same logged query 
or data view. When the schema graph is updated with a new query or data view 
15 containing new nodes, new entries are added to the occurrence frequency table to represent 
the new nodes, and are each given an initial frequency value of 1 indicating that the nodes 
are new and have not previously been observed. Likewise, for each node-pair from the 
new query comprising two new nodes or a new node and an existing node, a new entry is 
added to the co-occurrence frequency table and given an initial frequency value of 1, 
20 whilst for each node-pair comprising a new node and an existing node not present in the 
new query, a new entry is added to the co-occurrence frequency table but is given an initial 
frequency value of 0. 

As the schema graph is traversed, an occurrence probability is computed for each 
node, given the occurrences of the hit nodes. These conditional probability values are 
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computed or approximated from values stored in the frequency tables due to previously 
logged queries and data views, and are used to determine whether a node is a context node. 

The following is a description of the first method for the special case where there is 
a single hit node. Let this hit node be denoted by X. 
5 In the first phase, a bottom-up traversal through the schema graph is made 

beginning at node X. Each of X's ancestors Yj is considered in turn and its occurrence 
probability, given the occurrence of X, is computed: 

w ^ Eq.l 
freq(X) 

10 where the probability value has been approximated by an occurrence frequency freq(X) 
and a co-occurrence frequency freq(Yj, X). The latter denotes the frequency that X and Yj 
co-occur, where Yj is an ancestor of X. Both are obtained directly from the occurrence and 
co-occurrence frequency tables stated earlier. From these probability values computed for 
the ancestor nodes of X it is possible to determine the probability that a particular ancestor 

15 Yj is a root node, given X. Let Z u ... Zn denote the parent nodes of Yj, then 

Pr[7, root | X) = Pr[-,Z, a • - • a -,Z„ a Y l \ X] Eq.l 
That is, the probability that Yj is root, given X, is the probability that Yj is present 
and none of its parents are present, given X. Expanding the right hand side of Eq. 2 gives: 
Pr[7, root | X] = Pr[-,Z, a • • • a -»Z„ | Y ( a X] Pr[7, | X] 

= (1 - Pr[Z l v • . . v Z J 7, a X])?x[Y f \X] Eq. 3 

Since Zi, . . . Zn are mutually exclusive given Yj, (Y j can have at most one parent in 
any actual hierarchical data structure), 
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W, root | X] = - YfAZj 1 15 a *]) W, I *1 

= W, | AT]-^Pr[Z ; |K, a *]Pr[r, | *] 

=Pr[r,l^]-S Pr l z > Ar 'l^ Eq ' 4 

But since there is at most one directed path between Zj and X (a characteristic of 
the schema graph), it follows that this path must include Y,, and hence: 

Pr[Z, a Y, a X] = Pr[Z, a X] Eq ' 5 

o Pr[Z > Ar,|^] = Pr[Z,|A'] ** 6 
o Vr[Y l root\X\ = VT[Y l \X]-^7r[Z J \X] Eq.7 

In a preferred implementation, a number of alternative context trees are returned to 
the user as results of a keyword search operation, one for each ancestor node Yi of X 
whose associated probability Vx[Y,root\X] is greater than zero. These alternative 
context trees are each assigned a score being the associated probability Pr[7, root | X] and 
sorted according to these scores, from highest to lowest. Context trees with higher scores 
are considered to be of more interest to the user than those with lower scores. In an 
alternative implementation, only the context tree with the highest score is presented to the 
user as the result of the keyword search operation. 

For each ancestor node Y, that can serve as the root node of a context tree (ie. 
whose Pr[r, root \ X] > 0), a second phase, top-down traversal ftom Y, is performed to 
determine which of its descendants (except the hit node X) are context nodes. For each 
parent node Pj visited during this phase, an analysis is performed to determine which of its 
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children are context nodes. For each child node determined to be a context node, its 
children are in turn analysed in a top-down fashion to identify context nodes among them. 

There are two distinct scenarios in the analysis of a parent node Pj, as illustrated in 
Figs. 3A and 3B. The first is a special case shown in Fig. 3 A where Pj lies along the path 
5 from the root node Y« 3005 to the hit node X 3020. This includes the case Pj = Y { but 
excludes the case Pj = X. In this scenario, at least one child node of Pj, 3010, in this case 
the child node Q 3015, that lies along the path from Pj to X, must be identified as a 
context node. In the more general second scenario, encompassing all remaining cases as 
shown in Fig. 3B, the parent node Pj 3030 does not lie along a directed path from the root 
10 node Y, to the hit node X 3035 and thus it is not compulsory to identify any child nodes of 
Pj as context nodes. An algorithm for handling the second scenario will be presented first. 

For a given hit node X and a specific node Yi to serve as the root node of a context 
tree, the choice of whether some child node Cy of a parent node Pj is to be identified as a 
context node is in general a function of the probability that C k occurs given the presence of 
15 all nodes along the directed path from Yi to X: 

Pr[C t I^A -'A^roo/A — aPj] 
Since the evaluation or estimation of this probability is not possible with just the 
occurrence and co-occurrence frequency tables mentioned earlier, some form of 
simplification or approximation is needed. One such simplification preferably adopted is 
20 to ignore the effects of all nodes other than those from Y, to C k in the above probability 
expression, resulting in the expression: 

Pr[C 4 |y,roof a-- aP,] 



where 
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Vi[C k \Y ir oot A-APj = Pr[C t \Y,rootAPj] 

Pr[C t a Y, root aP,] 
Pr|Y, roof a P,] 



_ Pr[C t a jj wot"] Eq.8 
Pr[y,roof aP,] 

LetZ,,/= 1, denote the parent nodes ofYi, then the right hand side ofEq. 8 
can be expanded to 



Pr[Q a 7, ]- £Pr[C t a Z, ] yreg(y„C t )-t t freq(Z„C k ) 
w « . 



Eq.9 



_ — — ■ » p 

Pr[P, a Y, ]- £pr[P ; a Z, ] Jh*(r„P y ) - ^yre^.P, ) 
The above expression however, only deals with each individual child node C k in 
isolation. Unless C k are independent from one another (given Y s root), it is necessary to 
consider their joint probabilities. This however would require mamtaining frequency 
tables storing the joint-occurrence frequencies of a large number of combinations of nodes, 
many of which would rarely be observed and hence it would not be possible to reliably 
estimate their joint probabilities from their joint-occurrence frequencies. On the other 
hand, assuming independence among C k (given Yi root) may lead to undesirable results, 
such as none of the child nodes C* being selected if their individual occurrence 

probabilities (given Y\ root) are low. 

In order to avoid the undesirable effects of independence assumptions among C k , 
whilst at the same time avoiding the need to maintain a large number of joint-occurrence 
frequency values, a heuristic method 5000 depicted by the flowchart in Fig.5 may be used 
for selecting child nodes as context nodes. The method 5000 preferably operates as a sub- 
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program within the method 2000 upon the server 4030. 

The method 5000 begins at step 5005 where the occurrence probability of each 
child node C k given the root node Y,, denoted by Q k = Pr[C t | Y t root a Pj] is computed 
using Eq. 8 and Eq. 9. At the next step 5010, the probabilities Q* are summed over all 
child nodes C k , the sum being denoted by T. The method 5000 continues at step 5015 
where those nodes C* with the highest probability value are selected as context nodes. If 
more than one child node exists with the same highest probability value then all such 
nodes are selected as context nodes. The sum S of the probabilities of all child nodes so 
far selected as context nodes is then computed at step 5020. Execution then proceeds to 
the decision step 5025, at which point if all child nodes C k have been selected as context 
nodes then the method terminates at step 5040. If however there are one or more child 
nodes C k not yet selected as context nodes then the method 5000 continues to another 
decision step 5030. At step 5030 a check is made to ascertain whether S 1> T/2 and if so 
the method again terminates at step 5040. If S < T/2 then execution proceeds to step 5035. 
Here the list of child nodes C* not yet currently selected as context nodes are examined to 
identify those with the highest probability value among themselves. These are selected as 
context nodes and the method 5000 returns to step 5020 for further processing in the 

manner discussed above. 

The method 5000 has a number of desirable properties: 

. If logged queries or data views intersect with sufficiently high frequencies, (ie. 
with a relatively large number of child nodes in common) then the method 5000 
tends to return their intersections as context nodes. This is likely to lead to an 
acceptable result since an intersection that is sufficiently large tends to carry 



645893_final.doc 



-27- 



sufficient context information (for the hit node). 

• If logged queries or data' views have relatively few child nodes in common, then 
the resulting set of context nodes tends to comprise not only their intersections but 
also additional nodes. Experiments conducted by the present inventor show the 
resulting set of context child nodes tends to reflect that of the most frequent logged 
query or data view. This is significant since the intersections alone would not 
likely contain sufficient context information. 

. Due to the inclusion of child nodes with identically highest probability value as a 
whole, the method 5000 is biased towards identifying more rather than less nodes 
as context nodes. In the case where the sets of child nodes present in logged 
queries are mutually exclusive and occur with equal frequencies, the method 
identifies all child nodes as context nodes. 

In the method 5000 when a parent node is identified as a context node, one or more 
of its children are always identified as context nodes as well. This may be undesirable if 
there are many logged queries or data views in which the parent node occurs without any 
of its children (ie. occurs as a leaf node). Intuitively, if this occurs sufficiently often then 
the parent node alone should be identified as a context node without any of its children to 
reflect the frequently observed behaviour. 

To remedy this issue, a preferred implementation makes use of an additional leaf 
co-occurrence frequency table, generated and stored by the index server 4030. This table 
stores the frequency at which a node Pj co-occurs as a leaf node in past logged queries and 
data views with its ancestor Y if for every possible pairs of such nodes Pj and Y u excluding 
those nodes Pj that have no children in the schema graph. This new frequency table is then 
used to estimate the probability that a node Pj occurs as a leaf node, given Pj and some root 
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node Yj: 



Vt[P, leaf a Y,root) 
Pr[P, leaf | Y ( root a i>, ] = — — ^ 

Jreq(X„Pj leaf ^^Jreq^Pj leaf) 

m . liq. 1U 

^ — — — — ^— — — ^— • 

freqiX^-^freqiZ^Pj) 

where Z, , / - 1 P denote the parent nodes of Y, as defined earlier, wdfreq(Y t , Pj leaf) 

and freq{Zj, Pj leaf) are co-occurrence frequency values obtained from the new leaf co- 
occurrence frequency table. 

The probability Pr[P, leaf \ Y, root a Pj ] is preferably determined in an additional 

decision step prior to the method 5000 given in Fig. 5 for identifying which child nodes of 
Pj are context nodes. If Pr[/>, leaf \ Y, root a Pj] is less than 0.5, then no child nodes of Pj 
are selected as context nodes, otherwise the method 5000 is performed to identify which 
child nodes are context nodes. 

An alternative implementation is also possible, and employs an alternative 
method 6000 whose flowchart is given in Fig. 6 for selecting context nodes among a set of 
child nodes Cb k=l, m. The method 6000, which is also performed by the index 
server 4030, begins at step 6001 where a fictitious child node Co is conceptually created 
and added to the list of actual child nodes C, .... C m and is assigned a probability value 
Pr[P, leaf | Y, root a Pj] using Eq. 10. At the next step 6005, the actual child nodes C k 
are assigned their usual probability values Q k = Pr[C t | Y t root a Pj] using Eq. 8 and Eq. 
9. The method 6000 then continues at step 6006 by invoking method 5000 at step 5010 
(skipping step 5005) to select among the child nodes Co, C m a set of context nodes. 
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When the method 5000 exits, the method 6000 resumes at decision step 6010 where a 
check is made to determine if the fictitious child node Q> has been selected as a context 
node. If so then execution continues at step 6020 where Co is excluded as a context node. 
The method 6000 subsequently terminates at step 6015. If the test at 6010 fails, then the 
5 method 6000 proceeds directly to the termination step 6015. 

The idea behind the alternative method 6000 for incorporating the possibility that 
none of P/s child nodes are context nodes is essentially identical to that of the first. That 
is, when VrlPjleafWrootAPj] is sufficiently large. However, the effects of 
Pr[P, leaf \ Y t root a on the resulting set of context nodes are more gradual in this 
10 alternative approach, which is generally more favourable than the abrupt on/off behaviour 
of the first approach. 

For the special scenario where the parent node Pj lies along the directed path from 
the root node Y, to the hit node X, special considerations must be made to ensure that the 
child node of Pj that lies along the path from Pj to X is identified as a context node. 

15 Without loss of generality, let this child node be C, as illustrated in Fig. 3 as item 3015. 
Whilst the method 5000 presented earlier for the general scenario can be modified (for 
example by inflating the occurrence probability of Ci above those of all other child nodes 
prior to step 5015), such an approach may not yield correct results. This is because the 
method 5000 as described has been devised to select a set of the most frequently occurring 

20 child nodes as context nodes given the root Y» and parent Pj. If this set does not naturally 
contain C, then it basically means that C, is not related to nodes in the set. Forcefully 
including d would simply result in a set of child nodes that have Utile in common and 
provide little context for Cj (and subsequently for X). 
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Instead of modifying the method 5000, a different but somewhat procedurally 
similar method 7000 illustrated in Fig. 7 is preferably adopted in another implementation. 
The difference between this new 7000 and the previous 5000 methods lies in the 
independence of probability assumption used. Recall that the first simplification made in 
the general case where Pj does not lie along the directed path from Y« to X was the 
assumption that 

Pr[C t | X a — a Y, root a • • • a Pj] 
is independent of nodes other than those from Y t to Pj. In the current scenario where one 
child node C, of Pj lies along the path from X to Pj, it would not be sensible to assume that 
C k is independent of nodes from Pj to X (including C) as they are necessary ancestors of 
X that link the hit keyword X to C k . Since some simplifications are necessary to keep the 
problem tractable, it follows that a better choice is to assume an independence of 
probability assumption between C k and its ancestors above Pj towards the root node Y s . 
With this assumption, the probabilities of interest are 

Pt[CJ*a-aC, aPj] k*\ 
Again, since there is at most one directed path linking X and Pj, the above 
expression is equivalent to 

Pr[C A aXaP,] Eq. 11 

Pr[CJ*AP 7 ]= 

The numerator on the right hand side of Eq. 11 can not be obtained from the 
occurrence and co-occurrence frequency tables so far mentioned, since it involves three 
rather than two nodes. An extra joint-occurrence frequency table between 3-tuples of 
nodes is therefore required. Fortunately as each of these 3-tuples comprises a pair of 
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parent-child nodes C u and Pj (rather than any arbitrary pair of nodes), and since each node 
C k in practice has only a small number of parents, the new joint-occurrence frequency 
table would only be slightly larger than a co-occurrence frequency table involving pairs of 
nodes. 

With the new joint-occurrence frequency table, Pr[C t | XaPj] can be estimated 



as 



P.rr |K A P 1w MW Eq.12 

where freq(C*, Pj, X) denotes the joint-occurrence frequency between nodes C k , Pj and X, 
Pj is a parent of C k and an ancestor of X, and C* is neither *nor an ancestor of X. 

The method 7000 for detennining the set of siblings of C, to be included with Ci 
as context nodes is very similar to method 5000 already described. The method 7000 
begins at step 7001 where the occurrence probability of each child node C k * C, given the 
parent node Pj and the hit node X, denoted by Q k = Pr[C, | X a Pj] is computed using Eq. 
12. At the next step 7005, the probabilities Q k are summed over all child nodes C k * C,, 
the sum being denoted by T. The method 7000 continues at step 7010 where node Ci is 
selected as a context node, and then subsequently at step 7015 where those nodes C k * C, 
with the highest probability value are also selected as context nodes. If more than one 
child node exists with the same highest probability value then all such nodes are selected 
as context nodes. The sum of the probabilities of all child nodes so far selected as context 
nodes excluding Ci is then computed at step 7020, the sum being denoted by S. Execution 
then proceeds to the decision step 7025, at which point if all child nodes C k have been 
selected as context nodes then the method 7000 terminates at step 7040. If however there 
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are one or more child nodes C k not yet selected as context nodes then the method 
continues to another decision step 7030. At step 7030 a check is made to ascertain 
whether S £ T/2 and, if so, the method 7000 again terminates at step 7040. If S < T/2 then 
execution proceeds to step 7035. Here the list of child nodes C k * C, not yet currently 

5 selected as context nodes are examined to identify those with the highest probability value 
among themselves. These are selected as context nodes and the method returns to 
step 7020 for further processing. 

Some modifications are needed to method 7000 to allow for cases where no 
siblings of Ci are included in the solution. This is achieved by introducing a sole child co- 

10 occurrence frequency table that stores the frequency that a node Pj co-occurs with one of 
its descendants X such that only one child node of Pj (C, along the path from Pj to X) is 
present in past logged queries and data views. This frequency table is then used to 
estimate the probability that Ci has no sibling given its parent Pj and the bit node X: 
Pr[C, no sibling \ Pj a X] = Pr[C, a -.C, V* * 1 1 P, a X] 

15 = ?r[Pj has 1 child | P } a X] 

Pr[P, has 1 child a P, a 
Vr[Pj/KX] 

Pt[Pj haslchild a X] 
Vt[Pj*X] 

freqjPj has 1 child , X) Eq 13 

freq(Pj,X) 

where freq(Pj has 1 child, X) denotes the frequency at which node Pj co-occurs with its 
20 descendant X and Pj has a single child node (C,), and is obtained from the new frequency 
table. 
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In one implementation, the probability ¥i\.C x no sibling \Pj*X] is used in an 
additional decision step prior to the method 7000 given in Fig. 7 for identifying which 
child nodes of Pj are context nodes. If Pr[C, no sibling \PjaX] is less than 0.5, then no 
child nodes of Pj other than Ci are selected as context nodes, otherwise method 7000 is 
5 performed to identify which child nodes are context nodes. 

An alternative implementation is also possible. This employs an alternative 
method 8000 whose flowchart is given in Fig. 8 for selecting context nodes among a set of 
child nodes C*., k = 1, m. The method 8000 begins at step 8001 where a fictitious 

child node Co is conceptually created and added to the list of actual child nodes C, C m 

10 and is assigned a probability value Q 0 = Pr[C, nosibling | P, a using Eq. 13. At the 
next step 8005, the actual child nodes C k except C, are assigned their usual probability 
values Q k - Pr[C» | X a Pj] using Eq. 11. The method 8000 then continues at step 8006 
by invoking method 7000 at step 7005 (skipping step 7001) to select among the child 
nodes Co, .... O, a set of context nodes. When method 7000 exits, method 8000 resumes 
15 at decision step 8010 where a check is made to determine if the fictitious child node Q> has 
been selected as a context node. If so then execution continues at step 8020 where Co is 
excluded as a context node. The method 8000 subsequently terminates at step 8015. If the 
test at 8010 fails, then the method proceeds directly to the termination step 8015. 

The preceding discussion describes two distinct methods 6000 and 8000 for 
20 deterniining from a set of child nodes which are context nodes. Preferably the latter is 
applied in the scenario where the parent node Pj lies along the directed path from the root 
node Yj to the hit element X, whilst the former is used for all other parent nodes. In an 
alternative implementation, the first method 6000 is employed even for the case where Pj 
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lies along the path from Y, to X. If this results in a set of context child nodes that includes 
Ci, then the set is adopted, otherwise the set is discarded and the second method 8000 is 
applied to determine a new set of context child nodes. The rationale behind this favouring 
of the first method is that the probability values computed there are conditional on the root 
5 element Y s , rather than on the hit node X. Tests conducted by the present inventor seem to 
suggest that the root element of a data view tends to be a better indicator of what nodes are 
present in the view. 

The keyword searching system 4000 disclosed herein is a form of a learning 
system. From a set of logged queries and existing data views, which are akin to training 
10 examples, the system is able to synthesise new views of data. If patterns exist in the 
logged queries or data views, then they will be reflected in the frequency tables which in 
turn will affect the behaviour of the system 4000. A desirable feature for any learning 
system is an ability to make some form of generalisation that allows it to use patterns 
learned from one set of problems to improve its performance when handling related but 
15 yet unseen problems. One aspect of generalisation that is important in a hierarchical 
environment is the ability to observe occurrence patterns of certain sub-structures of data 
and generalise them to other similar or identical sub-structures. 

Consider the data structure 9000 shown in Fig. 9, in which there are two identical 
"Employee" sub-structures 9010 and 9030 (enclosed within the dotted curves), one under 
20 "Manager" 9005 and the other under "Project Members" 9025. Suppose that in all logged 
queries and data views, the sub-elements "FirstName" 9015 and "LastName" 9020 in the 
first Employee sub-tree have always been observed to appear together, whilst no queries or 
data views containing the second "Employee" sub-tree 9030 have yet to be observed. 
Suppose further that a keyword search operation for an employee's name is invoked in 
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which a "hit" is found in the ♦TirstName" sub-element 9035 of the second "Employee" 
sub-tree 9030, making 9035 the hit node. Even though no example queries or views have 
been encountered with this sub-element present, it is intuitively apparent that from the 
occurrence patterns observed for the first "Employee" sub-tree 9010, the sub-element 
5 "LastName" 9040 in the second ••Employee" sub-tree 9030 should be identified as a 
context node. 

Such a generalisation ability is particularly important when working with XML 
data since identical data sub-structures often exist at several locations in a data hierarchy 
(for example, as a result of the use of referenced schema elements). Such may be realised 
10 through probability averaging. Probability averaging works by appropriately averaging 
the occurrence probabilities of nodes in the schema graph that have identical names or IDs 
or labels. The application of probability averaging is now described firstly for the first 
top-down phase of the construction of the context tree, and then subsequently for the 
second bottom-up phase. 
15 Recall that the operation of the first phase relies on the probability values 

Pr[r, | X], where Y, are ancestors of the hit node X. To facilitate probability averaging, 
Pr[I", | X] is preferably first reformulated into an incremental form, as follows: Let W be 
a child of Yi that lies along the one and only directed path from Y s to X. Pr[7, | X] can 

then be rewritten as 

Pr[y, a X~\ 

20 Pr(r,|*]= 

PrjY, A.W a X\ (the path from Yi to X must include W) 

Pr[*] 

Prfy, | W a X\PtfJW a X] 
Pr[*] 
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10 



= Pr[y,|^A*]Pr[JH*] Eq,M 
That is, Pr[y, | X] can be incrementally obtained from the probability value of its 
child node W, namely Vx[W \X]. The idea is to begin the procedure at the hit node X and 
make use of the above expression to obtain probability values for successively higher 
ancestor nodes. At each step, the method of probability averaging is then applied to the 
first term on the right hand side of Eq. 14. Thus, let Pr'[* I X] denote the modified 
probability value of some node B as a result of probability averaging, then Pr'[y, | X} can 
be defined by the following recursive formulae: 

Pr'[A- 1*] = 1 Eq " 15 



15 



_r 0 *ypr'pn*] = ol 

^'^^^-[^^[Y.IWaX^t'IWIX] otherwise J 

where 

^Pr[y ft | W k a X k \Pv[W k a X k ] 
Vr^iY^W.X]^ £p^TX^ 



Eq. 16 



_ * 



XPr^A^AXJ 
£Pr[JF,AJr t ] 



"£pr(^A* t ] 

k 

ZfregV^X,) 



* * 



(the path from Yik to Xk must include Wnc) 



Eq. 17 



XJreq(W k ,X k ) 

k 

and denotes the weighted average or mean probability of Y, given W and X computed over 
all pairs of nodes (Y ilc , X k ) (for some values of k) that are equivalent to (Y, X), with Xo 
and Y i0 (ie. k = 0) being aliases for X and Y, respectively. For each of these equivalent 
pairs CY ik , Xk), the term W k in the summations denotes the immediate child of Y ik lying 
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10 



along the directed path from Yjk to Xk. 

A node pair (Yik, X k ) is said to be equivalent to a node pair (Yi, X) if 

(i) Y ik has the same name or label or ID as Yj and X k has the same name, label 
or ID as X, 

(ii) there are direct ancestor-descendant relationship between Y ik and X k and 
similarly between Yj and X, 

(iii) for each node W k along the directed path from Y ik to X k , there must exist a 
corresponding node W along the directed path from Yj and X such that 
(W k , Xk) is equivalent to (W, X) and (Y s , W k ) is equivalent to (Yj, W). 

(iv) Yi and Y ik have exactly the same number of parents and for each parent Zj of 
Yi, there exists a parent of Y ik such that (Zkj, Y ik ) and (Zj, Yj) satisfy 
conditions (i) to (iii) above. 

The modified probability that Yj is root given X due to probability averaging is 

then given by 



15 



where 

Pr'[Z y I X] = Pr_[Z y | Y, a X\?v'{Y t | X] Eq. 19 

as obtained from Eq. 16 by replacing Yi with Zj and W with Yj. 

In the event that the denominator on the right hand side of Eq. 17 is zero, 
20 indicating that none of the node pairs (W k , Xk) has been observed in logged queries and 
data views, Eq. 17 and hence Eq. 19 and Eq. 18 are undefined and consequently some 
alternative methods for identifying context nodes are needed. A preferred approach is to 
alternatively define Pr^tZ, | Y, a X] in terms of the distance of Z, from the hit node X 
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as follows: 



k 



k 



k 



'Eq. 20 



1 



ifT t freq(Y lk ,X k ) = 0, dist(Zj,X)<Zd, 



max 



o 



0"2>ef(r..**)-o, dist{Zj,x)>d, 



max 



where dmax is some threshold constant, and dist(A, B) is the distance between two nodes A 
and B in the schema graph, defined as the number of links along the path between A 
5 and B. In the absence of relevant past logged queries and data views, the distance between 
two nodes should give a good indication of how they are related to one another since in 
practice related data are usually stored in proximity of each other. 

A flowchart of a method 10000 for computing the probability that an ancestor node 
Yj of a hit node X is the root node of a context tree with probability averaging, for all 
10 ancestor nodes Y» is shown in Fig. 10. The method 10000 begins at step 10001 with Yj - 
X and hence Pr'(Y, \X] = l. At the next step 10005, Eq. 19 and Eq. 20 are used to 

compute Pr'[Z y | X] for each parent node Zj of Yj. Subsequent to step 10005, step 10010 

computes l*r'[Y t root \ X] according to Eq. 18. Step 100025 then tests to determine 

whether all parent nodes of Yj have been processed. If not, the method 10000 then 
15 proceeds to step 10015 where a parent node Zj of Yj is selected. Upon reaching 
step 10020, the method 10000 is recursively invoked at step 10005 (skipping step 10001) 
but with the selected parent node Zj playing the role of Yi. When this invocation returns, 
the current execution of method 10000 resumes and returns to step 10025 to check for 
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more parent nodes. When all parent nodes have been processed the method 10000 ends at 
step 10030 

Probability averaging is also applied to the second top-down traversal phase. In 
this phase, for the general case in where a parent node Pj does not lie along a directed path 
from the root node Yj to the hit node X, probability averaging can be applied in the same 
way as that used in the first phase. The selection of child nodes C k of Yj for inclusion in 
the keyword search result as a context node is based on the probabilities 

Pr[C» \Y,root*Pj] 

With probability averaging, the above expression is replaced by a mean probability 

^ Eq.21 



where (Y ih , Ca,) is equivalent to (Yj, Ck) and (P jh , Cm,) is equivalent to (Pj, Ck), with Y i0 , 
Cko and P j0 (ie. h - 0) being aliases for Yj, C k and Pj respectively. Let Zj denote the parents 
of Yj, and similarly Z jh the corresponding parents of Y ih . The above expression can be 
expanded to 

Wpr[C tt Arj-2>[C« AZjjj 

15 Pr.[C| I Y, root a Pj] — -U y 

2jPr[P, A A^J-JftlP, aZ,J| 



J^freqV^Cn ) - ^/req(Z Jh ,C„ )J 
^reqVto,P Jh )-^freq(Z Jhi P JH ^ 



Eq.22 



For the above expression to be an accurate approximation of the mean probability 
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?x at JC k | Y, root a Pj] , the denominator on the right hand side must be sufficiently large 
(eg. > some positive constant/^). When this is not the case, a preferred remedial method 
adopted in a preferred implementation is used to first approximate Pr mean [C t | Y t root a Pj] 
by Vx mem [C k \Y ( /\Pj], where the probability is conditional on Y,aP, rather than 
5 Y,rootAPj. Thus 

Pr meaB [C A I Y t root a Pj] * Yx aean [C k \ Y l a P j ] 
5>[C tt Ar tt ] 

h 



S3 



Eq.23 



A 

If the denominator on the right hand side of Eq. 23 is still not sufficiently large, 
10 then ¥r mem [C k \ Y, a Pj ] is further approximated by a probability conditioned on W rather 
than Yj, where W is the immediate child of Yj and an ancestor of C k . That is 
^ mem {C k | Y, root a Pj] « Pr meon [C 4 | W a Pj ] 

2*[C tt A!TJ 
= £pr[^A/>„J 

?^- C ») E0.24 

ft 

15 The method is repeated further until a sufficiently large value is obtained for the 

denominator on the right hand side, or if not, until W denotes a parent of C k . If the latter 
then ¥T metu ,[C k | Y, root aPj] is assigned a value based on the distance between C k and Y $ 
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Pr_[cjr,ro<*AP,]«| 0 otherwise 

or the distance between Ck and the hit node X: 
Pr meon [Cjr,r^]«| 0 Qtherwise 

Depending on whether Pr neon [C t | Y, root a P,] is eventually approximated by Eq. 
22, Eq. 23, Eq. 24, Eq. 25 or Eq. 26, the mean probability that a parent node Pj has no 
context child nodes given the root node Y|, is computed using Eq. 27, Eq. 28, Eq. 29, Eq. 
30orEq. 31 respectively 

Y.MPjH^afAY^root] 
Pr_ [Pj leaf I Y, root a Pj ] = L————^ 



h 



Jreq(Y lh ,P Jh lea^-^freqiZ^Pj, fea/)| 



Eq. 27 



^Pr[P yA /ea/Ay tt ] 



£/r«7(y tt ,/> yA ) 
* 

Pr^f^y '«/ 1 Y i root A p yl * Pr — .I^ /efl ^ I ^ A P J 
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Eq. 29 



if dist{P,Y,) + \<d^ 
otherwise 



Pr^^^/I^^AP,]^^ otherw . se 



Eq.31 



5 A preferred procedure for determining context child nodes for a parent node Pj 

given a root element Y* for the general case where Pj does not He along the directed path 
from Y s to the hit node X with probability averaging is similar to that shown in Fig. 6, and 
is shown in Fig. 13. The method 13000 begins at step 13001 where a fictitious child node 
Co is conceptually created and added to the list of actual child nodes Ci, .... C m and is 
10 assigned a probability value Q 0 = Pr meB(I [P, leaf \ Y, root a Pj] computed using Eq. 27, Eq. 
28, Eq. 29, Eq. 30 or Eq. 31 and at the next step 13005, the actual child nodes C k are 
correspondingly assigned probability values Q k = Pr m<an [C t | Y t root a P,] computed using 
Eq. 22, Eq. 23, Eq. 24, Eq. 25 or Eq. 26 respectively. In any case, step 13006 follows 
step 13005 and invokes the method 5000 at step 5010 (skipping step 5005) to select 

1 5 among the child nodes Co C m a set of context nodes. When method 5000 exits, the 

method 13000 resumes at decision step 13010 where a check is made to determine if the 
fictitious child node C 0 has been selected as a context node. If so then execution continues 
at step 13020 where Co is excluded as a context node. The method 13000 subsequently 
terminates at step 13015. If the test at 13010 fails, then the method proceeds directly to 
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the termination step 13015. 

Apart from their use in keyword searching, the methods 13000 and 6000 can also 
be used as means of selective presentation of hierarchical data. As already discussed, a 
practical hierarchical data source typically contains much more data than a user may wish 
to see at any given time. When a user views a hierarchical data source by selecting a node 
within its the data structure, a presentation application typically displays all data items in 
the sub-tree below the selected node, some of which may often not be of interest to the 
user. It would be highly desirable if the presentation application is able to filter out un- 
interesting data based on some previously observed viewing patterns of the user. The 
methods 13000 and 6000 as described are well suited for this task. By setting Y, = root 
node selected for viewing by the user, the set of context nodes identified by the methods 
constitute nodes that are likely to be of interest and preferably be displayed to the user, 
whilst the remaining nodes not identified as context nodes are preferably filtered out. 

For the special case where a parent node Pj lies on the directed path from the root 
node Yi to the hit node X, recall that the selection of child nodes C* of Y for inclusion as 
context nodes is based on the probabilities 

Pr[C t aJTaP,] Eq 32 

With probability averaging these are replaced by a mean probability: 

£Pr[C tt A* A AP,J 
Pr^[C t |^AP y3 ^ Zpr[ ^ A ^ 3 
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where (P Jh , C«) is equivalent to (Pj, C k ) and (P JH , X„) is equivalent to (Pj, X), with Pjo, Qo 
and X 0 (ie. h = 0) being aliases for Pj, C k and * respectively. 

For the above expression to be an accurate approximation of the mean probability 
Vr maui [C k | X a Pj], the denominator on the right hand side of Eq. 33 must be sufficiently 
large (eg. > /*,„). When this is not the case, another remedial method that may be used is 
to approximate Pr roeoB [C t | X aPj] by Pr Bea(1 [C, \X'*Pj], a probability conditioned on 
X' rather than X, where X' is the immediate parent of X lying on the directed path from Y ( 
toX. 

A flowchart of a method 22000 for identifying a node X' used for determining an 
approximation for W mam [C k \ X a Pj] is shown in Fig. 22. The method 22000 begins at 
step 22005 where X* is first initialised to X. At the next step 22010 the sum 
Y.fieqiPjnyX',,) is computed and assigned to D, where the node pairs (P jh , X' h ) are 

A 

equivalent to (P js X'). Decision step 22015 then follows and test if D is greater than or 
equal to some positive threshold constant U, If so, the method 22000 exits with success 
at step 22025. If the decision step 22015 fails then execution proceeds to another decision 
step 22030, where a test is made to determine whether X' is an immediate child of Pj. If 
so then the method exits with failure at step 22035, otherwise it continues at step 22040 
where X' is replaced by its parent lying along the directed path from Pj to X. The 
method 22000 then loops back to step 22010. 

If method 22000 succeeds with a node X* and a corresponding value D, then 

p rroean [C t | X a Pj] is assigned the value 
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£./^(C tt • P J" > X '*) Eq. 34 

Pr—EQ I X A Pj] m J 

In the event that method 22000 exits with failure, Pr me0 „[C t | X a Pj] is assigned a 
value based on the distance between Ck and Yj 

„, fl ifdUstiCtJJ^d^ Eq.35 

^[c k \x, Pj ^[ 0 otheriwse 

5 or the distance between Ck and the hit node X 
Pr meOT [CJ*AP,]«| 0 Qtheriwse 

Depending on whether Pr m<B(f [C* | * a P y ] is eventually approximated by Eq. 34, 
Eq. 35 or Eq. 36, the mean probability that a parent node Pj has no context child nodes 
other than the child node C, lying on the directed path from Pj to X, given Pj and the hit 
10 node X, is computed using Eq. 37, Eq. 38, or Eq. 39 respectively: 

S ft**?* **» 1 chM » X> " ) Eq. 37 

Pr m6an [C, nosibling \ P } a X] * -* ■ 

f0 i/rfwrCP.^ + l^^ Eq.38 
PCW-ci^l^Afl-^ otherw . se 

[0 if distiP.X^+XZd^ ^ 39 

where (P jh , X' h ) are equivalent to (Pj, X'), and X' and D are obtained by method 22000. 
1 5 A preferred procedure for determining context child nodes for a parent node Pj for 

the special case where Pj lies along the directed path from Yj to the bit node X with 
probability averaging is very similar to that shown in Fig. 8, and is shown in Fig. 14. 
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A method 14000 shown in Fig. 14 begins at step 14001 where a fictitious child 
node Co is conceptually created and added to the list of actual child nodes C,, . . ., C m and 
is assigned a probability value Q„ = Pr roeBB [C, no sibling | P, a *] computed using Eq. 37, 
Eq. 38 or Eq. 39, and at the next step 14005, the actual child nodes C k except Ci are 
correspondingly assigned probability values Q k = Pr meflfl [C» | X a J> ] computed using Eq. 
34, Eq. 35 or Eq. 36 respectively. In any case, step 14006 follows step 14005 and invokes 
method 7000 at step 7005 (skipping step 7001) to select among the child nodes Q>, . . ., C m 
a set of context nodes. When the method 7000 exits, the method 14000 resumes at 
decision step 14010 where a check is made to determine if the fictitious child node Co has 
been selected as a context node. If so then execution continues at step 14020 .where Co is 
excluded as a context node. The method 14000 subsequently terminates at step 14015. If 
the test at 14010 fails, then the method proceeds directly to the termination step 14015. 

The preceding discussion describes methods for identifying context nodes in the 
special case where there is at most a single hit node in the schema graph. This is a usual 
scenario when the user enters only a single search keyword. In the event that the keyword 
appears in multiple locations in the schema graph, signifying there are more than one hit, 
then each hit is preferably treated separately. That is, the methods as described are applied 
for a first hit node in the schema graph and a plurality of context trees are determined for 
the hit node. The same methods are then subsequently applied for each of the remaining 
hit nodes to obtain a new plurality of context trees, and so on. When all hit nodes have 
been processed, the generated context trees may be re-scored if they are found to 
encompass multiple hit nodes, and in addition duplicated context trees are removed. The 
list of the remaining context trees are then reordered according to their new scores (if any) 
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and returned to the user as the result of the keyword search operation. 

If the user however initiates a 'find all' keyword search operation involving 

multiple search keywords combined with a Boolean AND operation, then keyword hits 

can potentially appear in two or more hit nodes in the schema graph. A more general 
5 method for detennining context trees is now described for handling such a scenario. 

Fig. 11 shows an example of a schema graph 11000 within which there are 

multiple hit nodes 11010, 11020 and 11025. Let these hit nodes be denoted by*,, X„. 

Naturally, for a context tree to include all hit nodes, the root node of the smallest sub-tree 

containing all hit nodes, denoted by A (node 1 1005) must be returned as a context node, as 
10 well as all nodes lying along the directed path from A to each hit node. Thus node 11015 

must be a context node since it lies along the directed path from A to X 2 (and from A to 

x 3 ). 

The first, bottom-up phase of the context tree determination method begins at node 
A and traverses upwards. Let Y t be A or an ancestor of A, whose probability given the hit 
1 5 nodes, Pr|T, | X x a • • • a X„ ] , needs to be evaluated in order to determine the possible root 
node of a context tree. Expressed mathematically 

„ rv , „ y i Pr[7, a*. a-a*J Eq . 40 

PrKI^A-A*.]- Pr[; r |A ... A ;rj 

At this point some independence of probability assumptions are necessary since 
both the numerator and denominator on the right hand side of Eq. 40 cannot be obtained or 
20 estimated directly from the existing frequency tables for a general value of n (except for 
the denominator when n £ 2). A plausible assumption is that the set of X, are independent 
of one another given a common ancestor Yj. In other words: 
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5 



Thus 

pr[y, a x } a • • • a x„ } = Pr[ a • • • a x a | r,]Pr[y, ] 
= Pr[A-,|y f ]-Pr[jrjr l ]Pr[r < ] 

PrPf, Ay,-]-. Pr[^ n aTJ Eq.42 

pr[y,]"-' 

In order to remove the singularity when Pr[r,] = 0, Pr[7, a*, a -a*,] is redefined as 

0 «/Pr[y,]=o 
fttr l A^ l A».A < yj«j w< y iAyil ... ft[ ^ A y |] Bq ' 43 

1 pr[r,r 



10 



Similarly 

Pr[JT, A" AA'J = Pr[^AA', a — aA'J 

= Pt[*,a ••AA' n |/l]Pr[yl] 
= Pr[Jf,|^]-Pr[A'J^]PrM] 



Pr[^, A^l-Pr^ a^] Eq.44 

MAT' 1 

As in the case where there is only a single hit node, the occurrence probability of 
1 5 Yi given all hit nodes is preferably expressed incrementally in terms of the probability of 
its immediate child node to facilitate probability averaging, as follows: Let W denote the 
immediate child node of Y s along the directed path from Yj to A, then 

Pr'[/",I*,A "Ajrj= E * 45 
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Y t *AJr\W\X x a-aA"J = 0 



where 



*T\W\X x A»AX a \ 



Eq. 46 



where the pairs {Y lh , W h ) are equivalent to (Y h W), and X, h ) are equivalent to X,) 
for / = 1 and Y m , Wo and A}o (A = 0) are aliases for Y h W, and X, respectively. The 
term inside the summation on the numerator can be substituted by Eq. 43. The term inside 
the summation of the denominator can also be substituted by Eq. 43 by letting ^play the 
role of Y h thus resulting in 



iwr, I W a X x a ... a X„] = 2>*/2>* 



Eq.47 



where 



0 x/PrI7J = 0 

Pr[r M A^ 1A ]- Pr[y tt aXJ otherwise 



0 if freq{Y th ) = Q 

freqV,,, ,X lh )~ freqW* , X^ ) Qtherwis e 



freq(Y llt ) 



Eq. 48 
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0 (TPr[JFJ = 0 

Prr^A^.J-Prl^A^J Qtherwise 



iffreq<W h ) = 0 



freaOV^X^-freqiW^X,,,) Qtherwise 



Eq. 49 



?r me JY,\W*X x * -a*,] is undefined if £l> 4 =0. When this occurs 
Pr^j; \W a A', a-a^J is preferably assigned a value based on the distances from Y\ 
to the hit nodes X\, . . X„ as follows 



Pr^nM | FTaA", a- aA-J- 



1 i/ minder,,*,) £d r 
0 otherwise 



Eq. 50 



A flowchart of a method 12000 for computing the probability that a node Yj is the 
root node of a context tree containing all hit nodes, for all choices of Yi, is shown in 
Fig. 12. The method 12000 begins at step 12001 where the root node of the smallest sub- 
tree in the schema graph that contains all hit nodes is identified and denoted as A. 
Execution then continues at step 12002 where Y, is initialised to A and consequently 

Pr'[r, |^,a-a^] = 1. At the next step 12005 Eq. 45, Eq. 47 together with Eq. 48 and 
Eq. 49, or alternatively Eq. 50, are used to compute Pr'[Z, | X, a-a^J for each 

parent node Zj of Y,. Following step 12005, step 12010 computes 

Pr'ty, root | A", a-a^„] according to the equation 



Pr'[Y,root\X t a-a^„] = 



Pr'[^l^i A-AA-J-^MZy \Xt a • • • a X n ] 

J 



Eq. 51 
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The method 12000 then proceeds to step 12015 where a parent node Z, of Yi is 
selected. Upon reaching step 12020, the method 12000 is recursively invoked at 
step 12005 (skipping Steps 12001 and 12002) but with the selected parent node Zj playing 
the role of Yj. When the recursive invocation returns, execution resumes at decision 
step 12025 where a test is made to determine whether all parent nodes of Y s has been 
processed. If so, the method ends at step 12030, otherwise it continues at step 12015 
where another parent node Zj of Yj is selected for processing. 

In the second top-down traversal phase, for the general case where a parent node Pj 
does not lie along the directed path from the root node Yj to any hit node, the method for 
determining whether a child node of Pj is a context node remains unchanged from the 
method 13000 used previously for the case where there is only a single hit node. 

For the special case where the parent node Pj lies along the directed path from Yj to 
one or more hit nodes, it is necessary to modify the method used for the single hit node 
case to allow for the possibility that more than one child node of Pj must be included as 
context nodes. Recall that for the case involving only one hit node X, the determination of 
whether a child node C k is a context node is based on the probability value 
?r[C k \X*Pj-\ 

where A-is a descendant of Pj but is not C k or a descendant of C*. An extension when there 
are more than one hit nodes is to base the selection process of child nodes C k k = 1, . . ., m 
on the probabilities of C* given the parent node Pj and all hit nodes X, that are descendants 
of Pj, whilst ignoring the effects of hit nodes that are not descendants of Pj. Naturally, all 
child nodes C* that are themselves hit nodes or are ancestors of one or more hit nodes must 
be context nodes. Without loss of generality, let these child nodes be G , . . ., C r , where 1 <. 
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rZm. Similarly let the set of hit nodes that are descendants of C,, C r be Xu X s , 
where r <, s < n. If s= 1 (and hence r = 1) then this scenario is equivalent to the case 
where there is only a single hit node, and hence the method 14000 described for this case 
can be used. A method adopted in a preferred implementation for generalising for the 
5 case s > 1 is to replace the term Pr[C t | X a P, ] with the expression 



£Pr[CJ*,AP y ] 

which becomes 

e*=£Pr„ BW [C,|* f AJ> 7 ] Eq - 52 
after probability averaging, where Pr fflttn [Q | X, aP 7 ] is as defined in Eq. 33. Q k is 

10 undefined if freq^P^X,) = 0 for any X h / = 1 s. 

where 

freq^APjtX,)^*^'*^ 

k 

and (P Jh , X lh ) are equivalent to (P„ X,). Even when freq mean (Pj,X,) is non-zero but a 
small number (eg. < f mln ), Qk cannot be estimated from the frequency tables with sufficient 
15 accuracy. As in the case involving a single hit node, this problem can be overcome by 
replacing the hit nodes X,, Xs by a new set of nodes S in which some or all of the hit 
nodes are replaced by their ancestors, after which Q k is redefined in terms of elements 
ofS. 

A method 21000 depicted by the flowchart of Fig. 21 is preferably used to 
20 determine this new set of nodes S that replaces the hit nodes Xi, .... X s . The 
method 21000 begins at step 21005 where the initial set of hit nodes X! X, is denoted 
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by S. At the next step 21010 an unprocessed element X p of S is selected. Decision 
step 21015 then follows in which a check is made to determine if freq^PpX p ) is 
greater than or equal to some threshold constant /„,,„. If so then X p is retained in the set S 
and the method 21000 continues to decision step 21020 where a check is made to 
5 determine if all elements in S has been processed. If one or more unprocessed elements 
remain then execution returns to step 21010 to select another element of S for processing. 
If on the other hand all elements have been processed, then the method ends at step 21025 
with success. 

Returning now to decision step 21015. If the test condition fails then another 
10 decision step 21030 follows, which tests if the selected node X p is a child node of Pj. If it 
is then the method 21000 ends at step 21040 with failure, otherwise step 21035 follows. 
At step 21035, the element X p in S is replaced by its parent X' p that lies along the directed 
path from Pj to X p , and all descendants of X' p are removed from S. Execution then 
proceeds to step 21020. 

15 If method 21000 as described above returns with success, then the elements in the 

resulting set S are used to compute a value Q k for each child node C», . . ., 

Q k =i:^ mea n[C k \XAPj] Eq.53 

xes 

If however method 21000 returns with failure, then the value Q k is preferably 
determined from the distance between C* and the root node Y 



distiCnY^Zd^ Eq.54 
0 otherwise 



20 



or 



alternatively the distances between C* and the hit nodes Xi, . . X n : 
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G* = 



1 min dist(C k iXJ^d,^ Eq. 55 

0 otherwise 



Recall also that in the case where there is a single hit node X, it is necessary to 
evaluate the probability that given a parent node Pj and a hit node X, only one child node 
Ci of Pj occurs, where C\ is or is an ancestor of X: 
5 Pr[C, nosibling \ X a Pj] 

In generalising this quantity to the present scenario, two possibilities can arise, 
namely the special case where r = 1, and the more general case r > 1. An example of the 
former is shown in Fig. 15 where, there are three hit nodes 15030, 15035 and 15040 
located within the sub-tree rooted at node 15005. However, all three hit nodes reside 
10 under a single child node 15010 of node 15005. 

An approach for handling this special case is to replace the term 

Pr[C, nosibling | X a Pj] with the expression 

Go = S^tC, ^sibling | X, a Pj] *»■ 56 

in an analogous fashion to the use of the quantity Q k in Eq. 52. As in the case of Q k above, 
15 there is a possibility that Eq. 56 is undefined when freq^iP^X,)^ for any X,, 
/= 1, .... 5. Consequently, the actual value assigned to Qo is based on the set S obtained 
from the method 21000, if the method 21000 returns with success: 

Go - t C . nosibling \XaPj] Eq. 57 

Otherwise if the method 21000 fails, then Q 0 is assigned a value based on the 
20 distance of Pj from the root node Y\: 
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0 dist<JPj,Yd + ^ d ™ Eq.58 



^° ll otherwise 



or 



alternatively the distances between Pj and the hit nodes Xi, . . ., X„: 



(2o = 



0 min dist(P. ,X,) + l£d ma Eq . 59 

1 otherwise 



A method 16000 depicted in the flowchart of Fig. 16 for identifying context nodes 
5 among the set of child nodes C k of a parent node Pj for the case r = 1, s > 1 is very similar 
to the method 14000 for the single hit node case. The method 16000 begins at step 16001 
where a fictitious child node C 0 is conceptually created and added to the list of actual child 
nodes C. .... C m and is assigned a value go defined in Eq. 57, Eq. 58, or Eq. 59, and at the 
next step 16005, the actual child nodes C k except C, are assigned values Q k 
10 correspondingly defined in Eq. 53, Eq. 54, or Eq. 55 respectively. Step 16006 follows 
step 16005 and invokes the method 7000 at step 7005 (skipping step 7001) to select 
among the child nodes Co, .... C a set of context nodes. When the method 7000 exits, the 
method 16000 resumes at decision step 16010 where a check is made to determine if the 
fictitious child node C 0 has been selected as a context node. If so then execution continues 
15 at step 16020 where C 0 is excluded as a context node. The method 16000 subsequently 
terminates at step 16015. If the test at 16010 fails, then the method 16000 proceeds 
directly to the termination step 16015. 

For the general case where r > 1 (and hence s > 1), an analogous quantity to 

PrfC, nosibling | X t a • • • a X, a Pj ] used in the case r = 1 is 
20 X Pr t C < a-aC, A-iC^, A-A-.C m \X*Pj] 

XeS 

where S is the set returned by the method 21000 if it exits with success. Unfortunately the 
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probability in the summation cannot be easily estimated from the existing frequency 
tables. Consequently, a slightly different expression is used in its place. Let the elements 
of set S that are not located in the sub-tree rooted at each child node C k , 1 <. k < r be 
denoted by H* for 1 <, I <■ s k , where s k <, |S|. For each child node C k , 1 <. k <. r, the 
5 following is computed: 

&=£Pr*„.[Cj//„A/>] Eq.60 

The rationale behind the quantity above expression is that when summed together 
over all C k , 1 < k <. r, a quantity approximating the probability of child nodes C,, .... C r 
occurring together is obtained (although not a true probability since it can take on a value 
10 > 1). As in the case r - 1, if method 21000 returns with failure then Q k is obtained from 
the distance of Pj to from the root node Y u for 1 £ k <, r. 

distiPjW + l^d^ Eq.61 



^* h otherwise 



or alternatively the distances between Pj and the hit nodes Xi, . . X n : 



<2* = 



'0 min distiPjyXt)* 1 £ ^ 62 

1 otherwise 



15 



A method 17000 for selecting context nodes among the set of child nodes G, .... 
C m is now described for the general case r > 1, s > 1. with reference to the flowchart of 
Fig. 17. Method 17000 begins at step 17001 where each child node C k , 1 £ k <, r is 
assigned a value Q k computed using Eq. 60, Eq. 61, or Eq. 62. Step 17005 follows in 
which the remaining child nodes are assigned values Q* correspondingly computed using 
20 Eq. 53, Eq. 54, or Eq. 55 respectively. At the next step 17010, the values Q k are summed 
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over all child nodes and denoted by T. The method 17000 continues at step 17015 where 
all child nodes containing hit nodes in their sub-trees, namely C k , 1 <> k <. r, are selected as 
context nodes. At the next step 17020, nodes C k with the highest assigned value among 
the remaining child nodes are also selected as context nodes. If more than one child node 
5 exists with the same highest value then all such nodes are selected as context nodes. The 
sum of the assigned values of all child nodes so far selected as context nodes is then 
computed at step 17025 and denoted by S. Execution then proceeds to the decision 
step 17030, at which point if all child nodes C k have been selected as context nodes then 
the method 17000 terminates at step 17040. If however there are one or more child nodes 
10 C k not yet selected as context nodes then the method 17000 continues to another decision 
step 17035. At step 17035 a check is made to ascertain whether S £ T/2 and if so the 
method 17000 again terminates at step 17040. If S < T/2 then execution returns to 
step 17020 where more nodes are selected as context nodes. 

The preceding descriptions present various methods for handling different stages 
15 and operating scenarios encountered when performing keyword searching in hierarchical 
data structures. These methods are incorporated into a single overall procedure 18000 
which elaborates on step 2010 of Fig. 2, illustrated by the flowchart in Fig. 18 which 
comprises sub procedures 19000 and 20000 shown in Fig. 19 and Fig 20, respectively. 
The method 18000 begins at decision step 18005 where a check is made to determine 
20 whether there are multiple hit nodes in the schema graph. If so then execution proceeds to 
step 18015 where the method 20000 is invoked, otherwise it proceeds to step 18010 where 
the method 19000 is invoked. In either case, the method 20000 or 19000 returns with a 
list of context trees, each having an associated score. The following is a detailed 
description of the method 19000, followed by that of the method 20000. 

645893_flnal.doc 



58 



The method 19000 begins at step 19001 where the method 10000 is invoked to 
determine a list of possible root nodes Y, that are ancestor nodes of the hit node X. Each 
Yi is the root node of a possible context tree. The method 10000 also computes a value 
S t = Pr'[7, | X] for each node Yj. The method 19000 then continues at step 19005 where 
a node Y, determined in the previous step is selected for processing. At the next 
step 19010, method 38000 which is a sub-process within method 19000 is invoked to 
identify context nodes in the subtree rooted at node Yi. Method 19000 then continues at 
step 19030, where a context tree is constructed comprising all identified context nodes and 
with Y as the root node. The tree is assigned a score of S, computed at step 19001. The 
method 19000 then proceeds to decision step 19035. If all nodes Y, obtained at 
step 19001 have been processed, then the method ends at step 19040, otherwise it returns 
to step 19005 to process another node Yj. 

The method 38000 invoked within method 19000 begins at step 38010 where 
node Y is first assigned to Pj. Execution proceeds to the decision step 38015 and then to 
step 38020 if Pj does not he on the directed path from Y, to the hit node X. At step 38020, 
the method 13000 is invoked to select among the child nodes of Pj a set of context nodes. 
At the subsequent step 38025, the method 38000 is recursively invoked at step 38020 
(skipping steps 38010 and 38015) for each non-leaf child node C k selected as context 
node, with C k playing the role of Pj in order to identify additional context nodes among its 
descendants. When the invocations for all such child nodes return, method 38000 
terminates at step 38040. Method 38000 also proceeds directly to the termination step 
38040 if Pj has no child nodes, or if none of its non-leaf child nodes have been selected as 

context nodes at step 38020. 

The decision step 38015 succeeds if Pj lies on the directed path from Y to X, in 
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which case executions proceeds to step 38045. Here the method 14000 is invoked to 
select among the child nodes of Pj a set of context nodes, with C, denoting the child node 
lying on the directed path from Pj to X. A the subsequent step 38050, method 38000 is 
recursively invoked at step 38015 (skipping step 38010) for each non-leaf child node C k 
selected as context node, with C k playing the role of Pj in order to identify additional 
context nodes among its descendants. When the invocations for all such child nodes 
return, method 38000 terminates at step 38040. 

The method 20000 begins at step 20001 where the method 12000 is invoked to 
determine a list of possible root nodes Y } that are ancestor nodes of the hit nodes X u 
X n . Each Yi is the root node of a possible context tree. The method 12000 also computes 
avalue/f^Pr'^I^A.-A^] for each node Y, The method 20000 then continues at 
step 20005 where a node Y determined in the previous step is selected for processing. At 
the next step 20010, method 39000 which is a sub-process within method 20000 is 
invoked to identify context nodes in the subtree rooted at node Yi. Method 20000 then 
continues at step 20060, where a context tree is constructed comprising all identified 
context nodes and with Y, as the root node. The tree is assigned a score of Si computed at 
step 20001. The method 20000 then proceeds to decision step 20065. If all nodes Y 
obtained at step 20001 have been processed, then the method ends at step 20070, 
otherwise it returns to step 20005 to process another node Y 5 . 

The method 39000 invoked within method 19000 begins at step 39010 where node 
Y, is first assigned to Pj. Execution proceeds to the decision step 39015 and then to 
step 39020 if there are no hit nodes in the sub-tree root at Pj. At step 39020, the 
method 13000 is invoked to select among the child nodes of Pj a set of context nodes. At 
the subsequent step 39025, the method 39000 is recursively invoked at step 39020 
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(skipping steps 39010 and 39015) for each non-leaf child node C k selected as context 
node, with C k playing the role of Pj in order to identify additional context nodes among its 
descendants. When the invocations for all such child nodes return, method 39000 
terminates at step 39060. Method 39000 also proceeds directly to the termination 
5 step 39060 if Pj has no child nodes, or if none of its non-leaf child nodes have been 
selected as context nodes at step 39020. 

The decision step 39015 succeeds if there is one or more hit nodes within the sub- 
tree rooted at Pj, in which case execution proceeds to another decision step 39030. If there 
is only a single hit node in the sub-tree under Pj then this decision step fails and execution 
10 proceeds to step 39035, otherwise it continues to yet another decision step 39040. At 
decision step 39040, a test is made to determine whether all hit nodes under Pj are located 
under only one of its child nodes. If so, then execution proceeds to step 39045, otherwise 
it proceeds to step 39050. At step 39050, with C, .... Q denoting the child nodes of Pj 
under which one or more hit nodes reside, the method 17000 is invoked to select among 
15 the child nodes of Pj a set of context nodes. If however decision step 39040 leads to 
step 39045, then the method 16000 is invoked to select among the child nodes of Pj a set 
of context nodes, with C, being the sole child node of Pj that contains hit nodes in its sub- 
tree. 

Returning now to step 39035, let the path from Pj to its one and only descendant hit 
20 node pass through its child node C t . The method 14000 is invoked to select among the 
child nodes of Pj a set of context nodes. 

At the completion of each of steps 39035, 39045 and 39050, the method 39000 
recursively invokes itself at step 39015 (skipping step 39010) for each non-leaf child node 
C k selected as context node, with C k playing the role of Pj in order to identify additional 
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context nodes among its descendants. When the invocations for all such child nodes 
return, method 39000 terminates at step 39060. 

Illustrative Example 
The operation of a preferred implementation is now demonstrated with an example 

5 hierarchical XML data source below. The XML source comprises data relating to a 
company named "XYZ" such as its web addresses, branch names and locations, and its 
range of sales products at each branch. A schema graph representation of the XML data is 
shown in Fig. 23. 

10 XML SOURCE 

<company> 

<name>XYZ</name> 
<web>http://www.xyz.com</web> 

15 <description> 

Company founded in 1999 specialising in hi-tech consumers electronics 

</description> 
<branch> 

20 <name>North Ryde</name> 

<phone>0291 230000</phone> 

<address> 

<number>1 </number> 
25 <street>Lane Cove</street> 

<city>Sydney</city> 
<country>Australia</country> 

</address> 

30 <manager> 

<firstName>Jim</firstName> 
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<lastName>Smith</lastName> 
<email>jsmith@xyz.com</email> 

</manager> 



<product> 
<ld>K/ld> 

<name>Plasma TV</name> 
<price>$1 0000</price> 
<supplter>JEC</supplier> 
<stock>10</stock> 
</product> 

<product> 
<id>2</id> 

<name>Mp3 player</name> 
<price>$500</price> 
<supplier>HG</supplier> 
<stock>20</stock> 
</product> 
</branch> 

<branch> 

<name>Morley</name> 
<phone>0891230000</phone> 

<address> 

<number>1 </number> 
<street>Russel</street> 
<clty>Perth</clty> 
<country>Australia</country> 

</address> 
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<manager> 

<firstName>Ted</firstName> 

<lastName>White</lastName> 

<email>twhite@xyz.com</email> 

</manager> 



<product> 
<id>3</id> 

<name>Video phone</name> 
<price>$2000</prlce> 
<supp!ier>NVC</supplier> 
<stock>15</stock> 
</product> 

<product> 
<id>4</id> 

<name>PDA</name> 
<price>$1 000</price> 
<supp!ier>LP</supplier> 
<stock>50</stock> 
</product> 
</branch> 
</company> 

In Fig. 23, the integer shown next to each node is a unique ID number assigned to 
the node. Suppose that there are three existing views of this data source. The first is a 
view displaying the company's name, description and web address. The second is a listing 
of the company's branches and their locations, and finally the third view lists the line of 
products at each branch. Schema graph representations of these views are shown in 
Fig. 24, Fig. 25 and Fig. 26 respectively. As a result of these views, the occurrence 27000, 
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co-occurrence 28000, leaf co-occurrence 29000, and sole child co-occurrence 30000 
frequency tables are as shown in Fig. 27, Fig. 28, Fig. 29 and Fig. 30 respectively. The 
joint-occurrence frequency table, being three-dimensional, is depicted by five separate 
two-dimensional tables 31000, 32000, 33000, 34000 and 35000. Fig. 31 comprises entries 
freq(Qe, Pj, X) in the table with Pj = node 1. Similarly Fig. 32, Fig. 33, Fig. 34 and Fig. 35 
each comprises entries with Pj = node 3, node 8, node 9, and node 10 respectively. In all 
frequency tables shown, an empty cell such as Item 28005, as seen in Fig. 28, denotes an 
invalid node combination whose associated frequency is not required to be stored. 

Suppose that a user wishes to locate a particular product in the city where the user 
resides. The user enters the product's name, «Mp3 player", and the name of the city, 
"Sydney" and performs a keyword search for both names. As seen from Fig. 23 this 
results in two hit nodes X, = node 19 and X 2 = node 13. To determine possible context 
trees for the keyword search operation, the system 4000 invokes method 18000 of Fig. 18. 
Since there is more than one hit node, the method 18000 subsequently invokes the 
method 20000 at step 18015. The method 20000 in turn invokes the method 12000 at 
step 20001 to obtain a list of nodes Y s to serve as root nodes of the resulting context trees. 

The method 12000 first identifies at step 12001 node 3 as the root node of the 
smallest sub-tree containing both hit nodes X, and X 2 . Thus A = node 3. The 
method 12000 then begins a recursive procedure to compute an occurrence probability 
value for each of A and its ancestors, given the hit nodes. At node A 

Pr'[^|^, a* 2 ] = 1 
At Yi = node 1, the parent of node A, using Eq. 47, Eq. 48 and Eq. 49 

^mean[»0de 1 1 A A X x A X 2 ] = 
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freqjnode \,node \9)freq{node \,node \3)freq{node 3) 
freq{node 3,node 19) freq{A,node \3)freq(node 1) 



= 0 



Thus 



and 



Yv'[node\\X i aX 2 ] = 0 



Vt'[A root | X x a X 2 ] = l 

Consequently the method 12000 exits with node A as a single candidate root node 

for a context tree. This context tree is assigned a score of 1 . After the completion of the 

method 12000, the method 20000 continues and with the second, top-down traversal phase 

where descendants of the root node Yj = A are processed to identify context nodes among 

them. This phase begins at step 20010 where Pj is first set to be node 3. Since this node is 

an ancestor of the hit nodes X, and X 2 , which are located under two distinct child nodes, 

execution proceeds eventually to step 39050 of method 39000, where the method 17000 is 

invoked to determine context nodes among its children. The values Q, Qs assigned to 

the child nodes 6-10 respectively of node 3 due to method 17000 are as follows: 

6, = K^lnode 6\X t a.P,] + ?r mam [node 6\X 2 * Pj] 

freqjnode 6, node 3, node 1 9) | freainode 6, node 3, node 1 3) 
freq(node 3,node\9) freqinode 3, node 13) 

-!♦! 

1 1 

= 2 

Qi = K^inode 7 1 X x a Pj] + ?r mem [node 7\X 2 aPj] 

freqjnodel ,node 3, node 19) | freq(node7,node3,node 13) 
freq(node 3, node 1 9) freq{node 3, node 1 3) 
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= 2 

Q, =^ aem [nodeS\X l aP,] 

freqjnode %,node 3, node 19) 
freq{node 3, node 19) 

= 0 

& =V* mean [node9\ X x *Pj] + Vr meaa [node9\X 2 aP,] 

freqjnode 9, node 3, no<fe 19) ( freqjnode 9, nod e 3, node 1 3) 
yre0(m*te3,w<*fel9) freq(node 3, no<fe 13) 

= 1 

G 5 =Pr m „ B [n^lO|A' 2 AP,] 

fr eg (node \0,node 3,node\3) 
freqinode 3, node 1 3) 

= 0 

Thus the set Q,, .... Qs sorted in descending order is {Q,, Q 2 , Q 4 , Q 3 , Qs} and 
sums to T = 5. The set of context nodes selected by the method 17000 thus comprises 
node 6, node 7 (since Qi + Q2 > T/2), and node 8, node 10 (since they are ancestors of hit 
nodes). Resuming at step 39055, the method 39000 then recursively invokes itself to 
identify context child nodes for each of the selected nodes that have children. 

For Pj = node 8, execution proceeds to step 39035 since node 8 has a single 
descendant hit node (node 13), at which point method 14000 is invoked to identify context 
nodes among the set of child nodes 11-14. The probability values Q,, Q2, and Q 4 
assigned to the child nodes 11, 12, and 14 respectively of Pj due to the method 14000 are 

as follows: 

Q^Vr^ModelllX^Pj] 
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freqjnode 1 1, node 8, node 13) 
freqinode 8, node 13) 

= 1 

Q 2 = 1 Pr mem [node\2\X 2 ^P J ] 

freqjnode 12, node B,node 13) 
~ ~ freqinode 8, node 1 3) 

= 1 

freqjnode 14, node 8, node 13) 
freqinode 8, node 1 3) 

= 1 

In addition, the method 14000 also computes a value Q 0 for a fictitious child node Q>: 

freqjnode 8 has 1 c/»7d, node 13) 
freqinode 8, node 13) 

= 0 

Thus the set of probability values sorted in descending order is {Qi, Q 2 , (fc, Qo> 
and sums to T - 3. The set of context nodes selected by the method 14000 thus comprises 
node 11, node 12, node 14 (since Q, + Q 2 + Q 4 > T/2 and Q, = Ch = CW, and node 13 
(since it is an ancestor of a hit node). 

A similar execution path is followed for the case Pj = node 10, with similar results 
being obtained. The set of context child nodes of node 10 are nodes 18 - 22. The schema 
graph 3600 of the context tree is thus as shown in Fig. 36, comprising the hit nodes 19 
and 13, and context nodes 3, 6 - 8, 10 - 14, 18 - 22. The actual context tree returned to the 
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user comprising data items represented by these nodes is as follows: 



<branch> 

<name>North Ryde</name> 
5 <phone>0291 230000</phone> 

<address> 

<number>1 </number> 
<street>Lane Cove</street> 
10 <city>Sydney</city> 

<country>AustralIa</country> 

</address> 

<product> 

15 <id>1</id> 

<name>Plasma TV</name> 
<prlce>$1 0000</price> 
<supplier>JEC</supplier> 
<stock>10</stock> 

20 </product> 

<product> 
<id>2</id> 

<name>Mp3 player</name> 
25 <price>$500</price> 

<supp!ier>HG</supplier> 
<stock>20</stock> 
</product> 
</branch> 

30 



<branch> 

<name>Morley</name> 
<phone>0891 230000</phone> 
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<address> 

<number>1 </number> 
<street>Russel</street> 
<clty>Perth</city> 
<country>Australia</country> 

</address> 

<product> 
<id>3</id> 

<name>Video phone</name> 
<price>$2000</price> 
<supplier>NVC</supplier> 
<stock>15</stock> 
</product> 

<product> 
<id>4</id> 

<name>PDA</name> 
<price>$1 000</price> 
<supplier>LP</supplier> 
<stock>50</stock> 
</product> 
</branch> 

Industrial Applicability 

It is apparent from the above that the arrangements described are applicable to the 
computer and data processing industries, and particularly in respect of presenting 
information from multiple searches. 

The foregoing describes only some embodiments of the present invention, and 
modifications and/or changes can be made thereto without departing from the scope and 
spirit of the invention, the embodiments being illustrative and not restrictive. 
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(Australia Only) In the context of this specification, the word "comprising" m 
"including principally but not necessarily solely or "having" or "including", and 
"consisting only of. Variations of the word "comprising", such as "comprise" 
"comprises" have correspondingly varied meanings. 
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The claims defining the invention are as follows: 

1. A method of constructing at least one hierarchical data structure from at least one 
hierarchical data source, said method comprising the steps of: 

5 (i) constructing a representation of said least one hierarchical data source and 

at least one previous view of said least one hierarchical data source; 

(ii) identifying at least one compulsory entity in said representation; and 

(iii) constructing said at least one hierarchical data structure comprising said 
least one compulsory entity and one or more context entities, where said context entities 

10 are obtained from said representation and context data obtained from said least one 
previous view. 

2. A method according to claim 1 wherein said representation comprises a graphical 
representation. 

15 

3. A method according to claim 2 wherein said graphical representation comprises a 
schema representation of said least one hierarchical data source and said least one previous 
view. 

20 4. A method according to claim 1 wherein said context data comprises data 
evaluated to represent a measure of relevance of said context entities to said compulsory 
entity. 

5. A method according to claim 4 wherein said context data comprises at least one 
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numerical data. 

6. A method according to claim 5 wherein said least one associated numerical data 
comprises occurrence and joint-occurrence frequencies of entities in said representation 

5 observed in said least one previous view. 

7. A method according to claim 2 wherein the root node of said least one 
hierarchical data structure is an ancestor node of said least one compulsory entity in said 
representation. 

10 

8. A method according to claim 1 wherein said least one hierarchical data structure 
is assigned a score equal to the occurrence probability of said root node given the 
occurrence of each of said least one compulsory entity. 

15 9. A method according to claim 7 wherein said one or more context entities is 
selected from the group consisting of: 

(a) said root node; 

(b) a first set of nodes along at least one directed path in said representation 
from said root node to said least one compulsory entity, 

20 (c) a second set of nodes selected from descendant nodes of said root node in 

said representation, each said node in said second set being selected based upon a 
corresponding occurrence probability, each said occurrence probability being derived from 
the occurrence of ancestors of said node up to and including said root node and said least 
one compulsory entity; 
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(d) a third set of nodes selected from descendant nodes of said root node in 
said representation based on a corresponding distance of said third set node from said root 
node in said representation; and 

(e) a fourth set of nodes selected from descendant nodes of said root node in 
5 said representation based on a corresponding distance of said fourth set node from said 

least one compulsory entity in said representation. 

10. A method according to claim 9 wherein said second set of nodes comprises zero 
or more child nodes of at least one parent node in said representation lying along a 

1 0 directed path from said root node to said least one compulsory entity. 

11. A method according to claim 9 wherein said corresponding distances comprise a 
number of links separating the subject nodes in said representation. 

15 12. A method according to claim 10 wherein, for step (iii), said zero or more child 
nodes are selected as context nodes from all child nodes of said least one parent node, said 
selection comprising the steps of: 

(iii-a) computing a first occurrence probability of said parent node appearing with 
none of its child nodes other than a fifth set of nodes, given the occurrence of said parent 

20 node, ancestors of said parent node up to and including said root node and said least one 
compulsory entity, said fifth set comprising child nodes of said parent node lying along a 
directed path from said parent node to said least one compulsory entity; 

(iii-b) computing a second occurrence probability of each child node in a sixth set 
of nodes, given the occurrence of said parent node, ancestors of said parent node up to and 
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including said root node and said least one compulsory entity, said sixth set comprising 
child nodes of said parent node that do not lie along a directed path from said parent node 
to said least one compulsory entity; 

(iii-c) computing a total sum of said first occurrence probability and said second 

5 occurrence probabilities; 

(iii-d) creating a fictitious node and assigning said fictitious node said first 

occurrence probability; 

(iii-e) selecting said fifth set of child nodes as context nodes; 
(iii-f) selecting as context nodes a seventh set of child nodes formed from said 
10 sixth set of child nodes and said fictitious node arranged in order of descending values of 
said first occurrence probability or said second occurrence probability, and for which the 
sum of said first occurrence probability or said second occurrence probabilities of said 
seventh set of child nodes equals or exceeds half of said total sum; and 

(iii-g) deselecting as a context node said fictitious node if said fictitious node is 
1 5 selected in said seventh set of child nodes. 

12 A. A method according to claim 12 wherein said fictitious node prevents other nodes, 
whose associated probabilities are less than the probability associated with the fictitious 
node, from being selected, since nodes are selected as context nodes until their sum 
20 exceeds halt of the total sum-. 

13. A method according to claim 10 wherein, for step (iii), said zero or more child 
nodes are selected as context nodes from all child nodes of said least one parent node, said 
selection comprising the steps of; 
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(iii-a) computing a first occurrence probability of said parent node 
appearing with none of its child nodes other than a fifth set of nodes, given the occurrence 
of said parent node, ancestors of said parent node up to and including said root node and 
said least one compulsory entity, said fifth set comprising child nodes of said parent node 
5 lying along a directed path from said parent node to said least one compulsory entity; 

(iii-b) selecting said fifth set of child nodes as context nodes; and 
if said first occurrence probability is less than or equal to 0.5: 

(iii-c) computing, a second occurrence probability of each child node in a 
sixth set of nodes, given the occurrence of said parent node, ancestors of said parent node 
10 up to and including said root node and said least one compulsory entity, said sixth set 
comprising child nodes of said parent node that do not lie along a directed path from said 
parent node to said least one compulsory entity, 

(iii-d) computing a total sum of said second occurrence probabilities of 

said second set of child nodes; 
15 (iii-e) selecting as context nodes a seventh set of child nodes formed from 

said sixth set of child nodes in order of descending values of said second occurrence 

probability until the sum of said second occurrence probabilities of said seventh set of 

child nodes equals or exceeds half of said total sum. 

20 14. A method according to claim 9 wherein said second set of nodes comprises zero 
or more child nodes of at least one parent node in said representation not lying along a 
directed path from said root node to said least one compulsory entity. 

15. A method according to claim 14 wherein, for step (iii), said zero or more child 
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nodes are selected from all child nodes of said least one parent node, said selection 

comprising the steps of: 

(iii-a) computing a first occurrence probability of said parent node 
appearing without any of its child nodes given the occurrence of said parent node, 
5 ancestors of said parent node up to and including said root node and said least one 
compulsory entity; 

(iii-b) computing a second occurrence probability of each child node of 
said parent node given the occurrence of said parent node, ancestors of said parent node up 
to and including said root node and said least one compulsory entity, 
10 (iii-c) computing a total sum of said first occurrence probability and said 

second occurrence probabilities of all child nodes of said parent node; 

(iii-d) creating a fictitious node and assigning said fictitious node said first 
occurrence probability; 

(iii-e) selecting as context nodes those nodes from a set of said fictitious 
1 5 node and all child nodes of said parent node arranged in order of descending values of said 
first occurrence probability or said second occurrence probabilities until the sum of said 
first occurrence probability or said second occurrence probability of selected nodes equals 
or exceeds half of said total sum; and 

(iii-f) deselecting said fictitious node as a context node if said fictitious 
20 node is among said selected nodes. 

16. A method according to claim 14 wherein, for step (iii), said zero or more child 
nodes are selected from all child nodes of said least one parent node, said selection 
comprising the steps of 
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(iii-a) computing a first occurrence probability of said parent node appearing 
without any of its child nodes given the occurrence of said parent node, ancestors of said 
parent node up to and including said root node and said least one compulsory entity; and 
if said first occurrence probability is less than or equal to 0.5: 

5 (iii-b) computing a second occurrence probability of each child node of said 

parent node given the occurrence of said parent node, ancestors of said parent node up to 
and including said root node and said least one compulsory entity; 

(iii-c) computing a total sum of said second occurrence probabilities of all child 
nodes of said parent node, and 

10 (iii-d) selecting as context nodes, those nodes from the set of all child nodes of 

said parent node in order of descending values of said second occurrence probability until 
the sum of said second occurrence probability of selected nodes equals or exceeds half of 
said total sum. 

15 17. A method according to claim 12, 13, 15, or 16 wherein said first occurrence 
probability and said second occurrence probability are approximated using at least one 
occurrence frequency of a node in said representation, co-occurrence frequency between a 
pair of nodes in said representation, and joint-occurrence frequency between an n-tuple of 
nodes in said representation observed in said least one previous view. 

20 

18, A method according to claim 1 wherein said compulsory entity represents a 
location of one or more search keywords. 

19. A method according to claim 1 wherein said compulsory entity represents a user- 
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selected entity. 

* 

20. A method according to claim 16 wherein said first occurrence probability and 
said second occurrence probability are approximated using at least one occurrence 
5 frequency of a node in said representation, co-occurrence frequency between a pair of 
nodes in said representation, and joint-occurrence frequency between an n-tuple of nodes 
in said representation observed in said least one previous view. 

21 • A method according to claim 2 wherein said graphical representation comprises a 
10 tree representation and step (i) or (ii) includes detecting a user's selection, of a sub-tree of 
said representation, wherein, for step (in), zero or more child nodes of at least one parent 
node in said user-selected sub-tree are selected in a set of context nodes, said selection 
comprising the steps of 

(iii-a) computing a first occurrence probability of said parent node 
15 appearing without any of its child nodes given the occurrence of said parent node, 
ancestors of said parent node up to and including the root node of said user-selected sub- 
tree; 

(iii-b) computing a second occurrence probability of each child node of 
said parent node given the occurrence of said parent node, ancestors of said parent node up 
20 to and including the root node of said user-selected sub-tree; 

(iii-c) computing a total sum of said first occurrence probability and said 
second occurrence probabilities of all child nodes of said parent node; 

(iii-d) creating a fictitious node and assigning said fictitious node said first 
occurrence probability, 
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(iii-e) selecting as context nodes those nodes from the set of said fictitious 
node and all child nodes of said parent node in order of descending values of said first 
occurrence probability or said second occurrence probability until the sum of said first 
occurrence probability or said second occurrence probability of selected nodes equals or 
5 exceeds half of said total sum; and 

(iii-f) deselecting said fictitious node if said fictitious node is among said 

selected nodes 

22. A method according to claim 2 wherein said graphical representation comprises a 
10 tree representation and step (i) or (ii) includes detecting a user's selection of a sub-tree of 
said representation, wherein, for step (iii), zero or more child nodes of at least one parent 
node in said user-selected sub-tree are selected in a set of context nodes, said selection 

comprising the steps of: 

(iii-a) computing a first occurrence probability of said parent node 
15 appearing without any of its child nodes given the occurrence of said parent node, and 
ancestors of said parent node up to and including the root node of said user-selected sub- 
tree; 

if said first occurrence probability is less than or equal to 0.5 
(iii-b) computing a second occurrence probability of each child node of 
20 said parent node given the occurrence of said parent node, and ancestors of said parent 
node up to and including the root node of said user-selected sub-tree; 

(iii-c) computing a total sum of said second occurrence probability of all 
child nodes of said parent node; and 

(iii-d) selecting as context nodes those nodes from the set of all child 
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nodes of said parent node in order of descending values of said second occurrence 
probability until the sum of said second occurrence probability of selected nodes equals or 
exceeds half of said total sum. 

5 23. A method of selecting data from a hierarchically-structured data source, said 
method comprising the steps of 

(i) forming a graphical representation of said hierarchically-structured data 

source; 

(ii) detecting a user selection of part of said representation; 

10 (iii) selecting a set of descendant hierarchical components in said user-selected 

part based on an occurrence probability of said set of hierarchical components given a root 
component of said user-selected part. 

24. A method according to claim 23 wherein said graphical representation comprises 
15 a tree representation wherein said hierarchical components comprise nodes of said tree and 

said user-selected part comprises a sub-tree of said tree representation. 

25. A method of construction and presentation of data for a keyword searching 
operation in at least one hierarchical data source involving at least one search keyword, 

20 said method comprising the steps of: 

(i) constructing a graphical representation of said least one hierarchical data 
source and at least one previous view of said least one hierarchical data source; 

(ii) identifying at least one compulsory entity in said graphical representation, 
where said compulsory entity is a node in said graphical representation representing a 
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location of one or more said least one search keyword; 

(Hi) constructing at least one hierarchical data structure comprising said least 
one compulsory entity and one or more context entities, where said context entities are 
obtained from said graphical representation and context data obtained from said least one 

5 previous view; and 

(iv) presenting said least one hierarchical data structure as result of said 

keyword searching operation. 

26. A method of presentation of data sourced from a sub-tree of a hierarchically- 
1 0 presented data, said method comprising the steps of 

(i) selecting a set of descendant nodes in said sub-tree based on context data 
obtained from at least one previous presentation of said hierarchically-presented data; and 

(ii) constructing and presenting a hierarchical data structure comprising a root 
node of said sub-tree and said selected set of descendant nodes. 



15 



20 



27. A method of presenting data substantially as described herein with reference to 
any one of the embodiments of the invention as that embodiment is illustrated in the 
drawings. 

28. Computer apparatus adapted to perform the method of any one of claims 1 to 27. 

29. A computer readable medium, having a program recorded thereon, where the 
program is configured to make a computer execute a procedure to present data, said 
program incorporating code adapted to perform the method of any one of claims 1 to 27. 
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27. A computer readable medium, having a program recorded thereon, where the 
program is configured to make a computer execute a procedure to construct at least one 
5 hierarchical data structure from at least one hierarchical data source, said program 
comprising: 

(i) code for constructing a representation of said least one hierarchical data 
source and at least one previous view of said least one hierarchical data source; 

(ii) code for identifying at least one compulsory entity in said representation; 

10 and 

(iii) code for constructing said at least one hierarchical data structure 
comprising said least one compulsory entity and one or more context entities, where said 
context entities are obtained from said representation and context data obtained from said 
least one previous view. 

15 

28. A computer readable medium according to claim 27 wherein said representation 
comprises a graphical representation. 

29. A computer readable medium according to claim 28 wherein said graphical 
20 representation comprises a schema representation of said least one hierarchical data source 

and said least one previous view. 

30. A computer readable medium according to claim 27 wherein said context data 
comprises data evaluated to represent a measure of relevance of said context entities to 
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said compulsoiy entity. 

31. A computer readable medium according to claim 30 wherein said context data 
comprises at least one numerical data. 

5 

32. A computer readable medium according to claim 31 wherein said least one 
associated numerical data comprises occurrence and joint-occurrence frequencies of 
entities in said representation observed in said least one previous view. 

10 33. A computer readable medium according to claim 28 wherein the root node of said 
least one hierarchical data structure is an ancestor node of said least one compulsory entity 
in said representation. 

34. A computer readable medium according to claim 27 wherein said least one 
15 hierarchical data structure is assigned a score equal to the occurrence probability of said 

root node given the occurrence of each of said least one compulsory entity. 

35. A computer readable medium according to claim 34 wherein said one or more 
context entities is selected from the group consisting of: 

20 (a) said root node; 

(b) a first set of nodes along at least one directed path in said representation 
from said root node to said least one compulsory entity; 

(c) a second set of nodes selected from descendant nodes of said root node in 
said representation, each said node in said second set being selected based upon a 
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corresponding occurrence probability, each said occurrence probability being derived from 
the occurrence of ancestors of said node up to and including said root node and said least 
one compulsory entity; 

(d) a third set of nodes selected from descendant nodes of said root node in 
5 said representation based on a corresponding distance of said third set node from said root 

node in said representation; and 

(e) a fourth set of nodes selected from descendant nodes of said root node in 
said representation based on a corresponding distance of said fourth set node from said 
least one compulsory entity in said representation. 

10 

36. A computer readable medium according to claim 35 wherein said second set of 
nodes comprises zero or more child nodes of at least one parent node in said representation 
lying along a directed path from said root node to said least one compulsory entity. 

15 37. A computer readable medium according to claim 35 wherein said corresponding 
distances comprise a number of links separating the subject nodes in said representation. 

38. A computer readable medium according to claim 35 wherein, for code for 
constructing said hierarchical data, said zero or more child nodes are selected as context 
20 nodes from all child nodes of said least one parent node, said selection being performed 
by: 

(iii-a) code for computing a first occurrence probability of said parent node 
appearing with none of its child nodes other than a fifth set of nodes, given the occurrence 
of said parent node, ancestors of said parent node up to and including said root node and 
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said least one compulsory entity, said fifth set comprising child nodes of said parent node 
lying along a directed path from said parent node to said least one compulsory entity; 

(iii-b) code for computing a second occurrence probability of each child node in a 
sixth set of nodes, given the occurrence of said parent node, ancestors of said parent node 
5 up to and including said root node and said least one compulsory entity, said sixth set 
comprising child nodes of said parent node that do not lie along a directed path from said 
parent node to said least one compulsory entity; 

(iii-c) code for computing a total sum of said first occurrence probability and said 
second occurrence probabilities; 
10 (iii-d) code for creating a fictitious node and assigning said fictitious node said 

first occurrence probability; 

(iii-e) code for selecting said fifth set of child nodes as context nodes; 
(iii-f) code for selecting as context nodes a seventh set of child nodes formed 
from said sixth set of child nodes and said fictitious node arranged in order of descending 
15 values of said first occurrence probability or said second occurrence probability, and for 
which the sum of said first occurrence probability or said second occurrence probabilities 
of said seventh set of child nodes equals or exceeds half of said total sum; and 

(iii-g) code for deselecting as a context node said fictitious node if said fictitious 
node is selected in said seventh set of child nodes, 

20 

38A. A computer readable medium according to claim 38 wherein said fictitious node 
prevents other nodes, whose associated probabilities are less than the probability 
associated with the fictitious node, from being selected, since nodes are selected as context 
nodes until their sum exceeds halt of the total sum-. 
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39. A computer readable medium according to claim 35 wherein, for said code for 
constructing said hierarchical data, said zero or more child nodes are selected as context 
nodes from all child nodes of said least one parent node, said selection being performed 
5 by: 

(iii-a) code for computing a first occurrence probability of said parent 
node appearing with none of its child nodes other than a fifth set of nodes, given the 
occurrence of said parent node, ancestors of said parent node up to and including said root 
node and said least one compulsory entity, said fifth set comprising child nodes of said 
10 parent node lying along a directed path from said parent node to said least one compulsory 
entity; 

(iii-b) code for selecting said fifth set of child nodes as context nodes; and 
if said first occurrence probability is less than or equal to 0.5: 

(iii-e) code for computing, a second occurrence probability of each child 
15 node in a sixth set of nodes, given the occurrence of said parent node, ancestors of said 
parent node up to and including said root node and said least one compulsory entity, said 
sixth set comprising child nodes of said parent node that do not lie along a directed path 
from said parent node to said least one compulsory entity; 

(iii-d) code for computing a total sum of said second occurrence 
20 probabilities of said second set of child nodes; 

(iii-e) code for selecting as context nodes a seventh set of child nodes 
formed from said sixth set of child nodes in order of descending values of said second 
occurrence probability until the sum of said second occurrence probabilities of said 
seventh set of child nodes equals or exceeds half of said total sum. 
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40. A computer readable medium according to claim 34 wherein said second set of 
nodes comprises zero or more child nodes of at least one parent node in said representation 
not lying along a directed path from said root node to said least one compulsory entity. 

5 

41. A computer readable medium according to claim 40 wherein, for said code for 
constructing said hierarchical data, said zero or more child nodes are selected from all 
child nodes of said least one parent node, said selection being performed by:: 

(iii-a) code for computing a first occurrence probability of said parent 
10 node appearing without any of its child nodes given the occurrence of said parent node, 
ancestors of said parent node up to and including said root node and said least one 
compulsory entity; 

(iii-b) code for computing a second occurrence probability of each child 
node of said parent node given the occurrence of said parent node, ancestors of said parent 
1 5 node up to and including said root node and said least one compulsory entity, 

(iii-c) code for computing a total sum of said first occurrence probability 
and said second occurrence probabilities of all child nodes of said parent node; 

(iii-d) code for creating a fictitious node and assigning said fictitious node 
said first occurrence probability; 
20 (iii-e) code for selecting as context nodes those nodes from a set of said 

fictitious node and all child nodes of said parent node arranged in order of descending 
values of said first occunrence probability or said second occurrence probabilities until the 
sum of said first occurrence probability or said second occurrence probability of selected 
nodes equals or exceeds half of said total sum; and 
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(iii-f) code for deselecting said fictitious node as a context node if said 
fictitious node is among said selected nodes 

42. A computer readable medium according to claim 39, 40, or 41 wherein said first 
5 occurrence probability and said second occurrence probability are approximated using at 
least one occurrence frequency of a node in said representation, co-occurrence frequency 
between a pair of nodes in said representation, and joint-occurrence frequency between an 
n-tuple of nodes in said representation observed in said least one previous view. 

10 43. A computer readable medium according to claim 27 wherein said compulsory 
entity represents a location of one or more search keywords. 

44. A computer readable medium according to claim 27 wherein said compulsory 
entity represents a user-selected entity. 

15 

45. A computer readable medium according to claim 44 wherein said first occurrence 
probability and said second occurrence probability are approximated using at least one 
occurrence frequency of a node in said representation, co-occurrence frequency between a 
pair of nodes in said representation, and joint-occurrence frequency between an n-tuple of 

20 nodes in said representation observed in said least one previous view. 

46. A computer readable medium, having a program recorded thereon, where the 
program is configured to make a computer execute a procedure to select data from a 
hierarchically-structured data source, said program comprising: 
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(i) code for forming a graphical representation of said hierarchically-structured 
data source; 

(ii) code for detecting a user selection of part of said representation; 

(iii) code for selecting a set of descendant hierarchical components in said user- 
selected part based on an occurrence probability of said set of hierarchical components 
given a root component of said user-selected part. 

47. A computer readable medium according to claim 46 wherein said graphical 
representation comprises a tree representation wherein said hierarchical components 
comprise nodes of said tree and said user-selected part comprises a sub-tree of said tree 
representation. 

48. A computer readable medium, having a program recorded thereon, where the 
program is configured to make a computer execute a procedure to construct and present 
data for a keyword searching operation in at least one hierarchical data source involving at 
least one search keyword, said program comprising: 

(i) code for constructing a graphical representation of said least one 
hierarchical data source and at least one previous view of said least one hierarchical data 
source; 

(ii) code for identifying at least one compulsory entity in said graphical 
representation, where said compulsory entity is a node in said graphical representation 
representing a location of one or more said least one search keyword; 

(iii) code for constructing at least one hierarchical data structure comprising 
said least one compulsory entity and one or more context entities, where said context 
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entities are obtained fiom said graphical representation and context data obtained from 
said least one previous view; and 

(iv) code for presenting said least one hierarchical data structure as result of 

said keyword searching operation. 

5 

49. A computer readable medium, having a program recorded thereon, where the 
program is configured to make a computer execute a procedure to present data sourced 
from a sub-tree of a hierarchically-presented data, said program comprising: 

(i) code for selecting a set of descendant nodes in said sub-tree based on 
10 context data obtained from at least one previous presentation of said hierarchically- 
presented data; and 

(ii) code for constructing and presenting a hierarchical data structure 
comprising a root node of said sub-tree and said selected set of descendant nodes. 

15 50. Computer apparatus for constructing at least one hierarchical data structure from 
at least one hierarchical data source, said apparatus comprising 

a first constructing module configured to construct a representation of said least 
one hierarchical data source and at least one previous view of said least one hierarchical 
data source; 

20 an identifying module configured to identify at least one compulsory entity in said 

representation; and 

a second constructing module configured to construct said at least one hierarchical 
data structure comprising said least one. compulsory entity and one or more context 
entities, where said context entities are obtained from said representation and context data 
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obtained from said least one previous view. 

51. Computer apparatus for selecting data from a hierarchically-structured data 
source, said apparatus comprising: 

5 a forming module configured to form a graphical representation of said 

hierarchically-structured data source; 

a detecting module configured to detect a user selection of part of said 

representation; 

a selecting module configured to select a set of descendant hierarchical 
10 components in said user-selected part based on an occurrence probability of said set of 
hierarchical components given a root component of said user-selected part. 

52. Computer apparatus according to claim 51 wherein said graphical representation 
comprises a tree representation wherein said hierarchical components comprise nodes of 

1 5 said tree and said user-selected part comprises a sub-tree of said tree representation. 

53. Computer apparatus for construction and presentation of data for a keyword 
searching operation in at least one hierarchical data source involving at least one search 
keyword, said apparatus comprising: 

20 a constructing module adapted to construct a graphical representation of said least 

one hierarchical data source and at least one previous view of said least one hierarchical 
data source; 

an identifying module adapted to identify at least one compulsory entity in said 
graphical representation, where said compulsory entity is a node in said graphical 
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representation representing'a location of one or more said least one search keyword; 

a constructing module adapted to constructing at least one hierarchical data 
structure comprising said least one compulsory entity and one or more context entities, 
where said context entities are obtained from said graphical representation and context 
5 data obtained from said least one previous view; and 

a presenting module adapted to present said least one hierarchical data structure 
as result of said keyword searching operation. 

54. Computer apparatus for presentation of data sourced from a sub-tree of a 
10 hierarchically-presented data, said apparatus comprising :f 

a selecting module adapted to select a set of descendant nodes in said sub-tree 
based on context data obtained from at least one previous presentation of said 
hierarchically-presented data; and 

a constructing module adapted to construct and present a hierarchical data 
1 5 structure comprising a root node of said sub-tree and said selected set of descendant nodes. 

DATED this TWENTY-EIGHTH Day of NOVEMBER 2003 
CANON KABUSH1KI KAISHA 
Patent Attorneys for the Applicant 
on SPRUSON&FERGUSON 
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Invoke method 13000 to 
select among child nodes 
of Pj as context nodes. 



Invoke method 36000 at 
step 38020 for each child 
node Ck selected as 
context node, but with C* 
playing the role of P> 
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select among child 
nodes of Pj as context 
nodes. 



38050 



Invoke method 38000 at 
step 38015 for each child 
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